Converting a normal regular RegEx to one in SAS - regex

Although reading SAS documentation and various example pages, I am struggeling to convert a slightly more complicated RegEx to SAS syntax. I using the command prxchange. This is what I came up so far convert a filename-string like pre_31DEC2019_299792458.xls to an integer number (of length 8) 299792458 inside a SAS data step:
tmp=prxchange('s/pre_([a-zA-Z0-9]{8,9})_([0-9]{1,16})\.xls/\2/g',-1,have);
want=input(tmp,8.);
The error message I have points to somewhere else in the code, but I am rather certain that it is those two lines which cause a problem since leaving out the two quoted lines makes the SAS error message vanish.
References
Inofficial SAS howto on RegEx suggests that I could use standard RegExes.

Why use regex at all?
want = input(scan(have,-2,'._'),32.);

You can use
tmp = prxchange('s/^pre_[A-Za-z0-9]+_([0-9]+)\.xls$/$1/', -1, have);
See the regex demo
Details
s/ - substitution action (we are replacing the match)
^ - start of string
pre_ - a literal prefix
[A-Za-z0-9]+ - one or more alphanumeric ASCII chars (note you may simply use .* here instead if there can be anything)
_ - an underscore
([0-9]+) - Group 1: one or more digits
\.xls$ - .xls at the end of string
$1 - the whole match, the whole string matched, will be replaced with the contents of Group 1.
As far as the prxchange function is concerned, note that it replaces all occurrences of the pattern once you pass -1 as the times argument, thus, no g flag is necessary.

Many ways you could try:
data _null_;
a="pre_31DEC2019_299792458.xls";
b=input(prxchange('s/.*\_(.*)\..*/$1/',-1,a),12.);
c=input(prxchange('s/.*(\d{9}).*/$1/',-1,a),12.);
d=input(prxchange('s/.*(?<=\_)(\d+).*/$1/',-1,a),12.);
put _all_;
run;
.* means any one character many times; for b, the numbers you need are between _and .; for c, it is 9 digitals; for d, it look behind "_" to find digitals.

Related

PostgreSQL: .csv regex - test for repeating substrings within a string (digits)

Introduction:
I have the following scenario in PostgreSQL whereby I want to perform some data validation on a .csv string prior to inserting it into a table (see the fiddle here).
I've managed to get a regex (in a CHECK constraint) which disallows spaces within strings (e.g. "12 34") and also disallows preceding zeros ("00343").
Now, the icing on the cake would be if I could use regular expressions to disallow strings which contain a repeat of an integer - i.e. if a sequence \d+ matched another \d+ within the same string.
Is this beyond the capacities of regular expressions?
My table is as follows:
CREATE TABLE test
(
data TEXT NOT NULL,
CONSTRAINT d_csv_only_ck
CHECK (data ~ '^([ ]*([1-9]\d*)+[ ]*)(,[ ]*([1-9]\d*)+[ ]*)*$')
);
And I can populate it as follows:
INSERT INTO test VALUES
('992,1005,1007,992,456,456,1008'), -- want to make this line unnacceptable - repeats!
('44,1005,1110'),
('13, 44 , 1005, 10078 '), -- acceptable - spaces before and after integers
('11,1203,6666'),
('1,11,99,2222'),
('3435'),
(' 1234 '); -- acceptable
But:
INSERT INTO test VALUES ('23432, 3433 ,00343, 567'); -- leading 0 - unnacceptable
fails (as it should), and also fails (again, as it should)
INSERT INTO test VALUES ('12 34'); -- spaces within numbers - unnacceptable
The question:
However, if you notice the first string, it has repeats of 992and 456.
I would like to be able to match these.
All of these rules do not have to be in the same regex - I can use a second CHECK constraint.
I would like to know if what I am asking is possible using Regular Expressions?
I did find this post which appears to go some (all?) of the way to solving my issue, but I'm afraid it's beyond my skillset to get it to work - I've included a small test at the bottom of the fiddle.
Please let me know should you require any further information.
p.s. as an aside, I'm not very experienced with regexes and I would welcome any input on my basic one above.
Since PostegreSQL regex does not support backreferences, you cannot apply this restriction because you would need a negative lookahead with a backreference in it.
Have a look at this PCRE regex:
^(?!.*\b(\d+)\b.*\b\1\b) *[1-9]\d* *(?:, *[1-9]\d* *)*$
See this regex demo.
Details:
^ - start of string
(?!.*\b(\d+)\b.*\b\1\b) - no same two numbers as whole word allowed anywhere in the string
* - zero or more spaces
[1-9]\d* - a non-zero digit and then any zero or more digits
* - zero or more spaces
(?:, *[1-9]\d* *)* - zero or more occurrences of
, * - comma and zero or more spaces
[1-9]\d* - a non-zero digit and then any zero or more digits
* - zero or more spaces
$ - end of string.
Even if you replace \b with \y (PostgreSQL regex word boundaries) in the PostgreSQL code, it won't work due to the drawback mentioned at the top of the answer.

Regex with 2 semi colons in notepad++

I have data like this
Giftsbirth;;Basket7;CC
Giftswedding;;Cake4;COD
I am trying to find a regex that will only select the second data (Basket7, Cake4).
From past help I tried something like
^(\w+ [^\v;;]+;;[^\v;]+)?.*
But I know that is not right
Please assist with the regex if you can
You could use a positive lookbehind (?<= to assert what is before is ;; and a positive lookahead (?= to assert that what follows is ;
Use a negative character class [^;]+ to match not a ; to match your values.
(?<=;;)[^;]+(?=;)
You may use
(?:.*;)?([^;\n\r]+);[^;\n\r]+$
Or,
.*?;;([^;\r\n]+)(?:;.*)?
and replace with $1.
Details
(?:.*;)? - an optional substring having 0+ chars other than line break chars, as many as possible, up to the ;
([^;\n\r]+) - Group 1: any one or more chars other than CR, LF and ;
; - a semi-colon
[^;\n\r]+ - any one or more chars other than CR, LF and ;
$ - end of line.
The second regex matches
.*?;; - any 0+ chars as few as possible up to (and including) the first ;;
([^;\r\n]+) - Group 1: any one or more chars other than CR, LF and ;
(?:;.*)? - an optional group matching 1 or 0 occurrences of a ; and then any 0+ chars up to the end of line
The $1 in the replacement is the value you need to keep.
You need to specify more precisely what "the second data (Basket7, Cake4)" means. This looks like CSV data with the ; set as separator, but that would place Basket7 and Cake4 in the third column, since the second column is empty. In order to write a regex that solves this problem in the general case, you need to take into account the full domain of possible lines, and you've only given two examples and let everyone guess what the underlying format and total possible variations might be.
For example, is it always reasonable to assume that that which you're looking for is always preceded by ;; and ends with a ;, and that ;; never occurs in other places than immediately before that which you're looking for? In that case, (?<=;;)([^;]*) captures this. But what if you encounter one of the following lines?
Giftsbirth;;;CC # Here, the thing matched is empty
Giftsbirth;1600;Basket7;CC # Here, the second column isn't empty
;;Basket7;CC # Here, the first column is empty
;;;CC # Here, all but the last column are empty
;;; # Here, all columns are empty
You may experience that various suggestions will give you "the right text", but if you test this on a limited subset that does not account for all variations that can reasonably be expected in the input, you will inevitably have to revise your regex.
Assuming this is a CSV where the fields don't contain literal ;s, and that you don't know anything about the length of any of the fields (and consequently that the second column isn't always empty), but that there are at least three columns, you could consider the regex:
^[^;]*;[^;]*;([^;]*)
(See demo at https://regex101.com/r/vhPNEj/1)
These assumptions may not be correct, but my ability to guess are much worse than yours, since you're sitting with a larger sample size of data. In order to succeed at automating your tasks, it is critical that you learn to modify code to fit your assumptions.
For example, you may want to disregard the cases where the third column is empty:
^[^;]*;[^;]*;([^;]+)
Here the difference is [^;]* changed into [^;]+.
Or you may want to take into account that the first column could contain semicolons when they are wrapped in double quotes, e.g. like "Giftsbirth; Holiday";;Basket7;CC:
^(?:[^;"]*|"[^"]*");[^;]*;([^;]*)
Here the difference is [^;]* changed into (?:[^;"]*|"[^"]*") being either [^;"]* (being all but ; and ") or "[^"]*" (being " followed by anything but ", which includes ;, followed by ").

Hive REGEXP_EXTRACT returning null results

I am trying to extract R7080075 and X1234567 from the sample data below. The format is always a single upper case character followed by 7 digit number. This ID is also always preceded by an underscore. Since it's user generated data, sometimes it's the first underscore in the record and sometimes all preceding spaces have been replaced with underscores.
I'm querying HDP Hive with this in the select statement:
REGEXP_EXTRACT(column_name,'[(?:(^_A-Z))](\d{7})',0)
I've tried addressing positions 0-2 and none return an error or any data. I tested the code on regextester.com and it highlighted the data I want to extract. When I then run it in Zepplin, it returns NULLs.
My regex experience is limited so I have reviewed the articles here on regexp_extract (+hive) and talked with a colleague. Thanks in advance for your help.
Sample data:
Sept Wk 5 Sunny Sailing_R7080075_12345
Holiday_Wk2_Smiles_X1234567_ABC
The Hive manual says this:
Note that some care is necessary in using predefined character classes: using '\s' as the second argument will match the letter s; '\\s' is necessary to match whitespace, etc.
Also, your expression includes unnecessary characters in the character class.
Try this:
REGEXP_EXTRACT(column_name,'_[A-Z](\\d{7})',0)
Since you want only the part without underscore, use this:
REGEXP_EXTRACT(column_name,'_([A-Z]\\d{7})',1)
It matches the entire pattern, but extracts only the second group instead of the entire match.
Or alternatively:
REGEXP_EXTRACT(column_name,'(?<=_)[A-Z]\\d{7}', 0)
This uses a regexp technique called "positive lookbehind". It translates to : "find me an upper case alphabet followed by 7 digits, but only if they are preceded by an _". It uses the _ for matching but doesn't consider it part of the extracted match.

RegEx Lookaround issue

I am using Powershell 2.0. I have file names like my_file_name_01012013_111546.xls. I am trying to get my_file_name.xls. I have tried:
.*(?=_.{8}_.{6})
which returns my_file_name. However, when I try
.*(?=_.{8}_.{6}).{3}
it returns my_file_name_01.
I can't figure out how to get the extension (which can be any 3 characters. The time/date part will always be _ 8 characters _ 6 characters.
I've looked at a ton of examples and tried a bunch of things, but no luck.
If you just want to find the name and extension, you probably want something like this: ^(.*)_[0-9]{8}_[0-9]{6}(\..{3})$
my_file_name will be in backreference 1 and .xls in backreference 2.
If you want to remove everything else and return the answer, you want to substitute the "numbers" with nothing: 'my_file_name_01012013_111546.xls' -replace '_[0-9]{8}_[0-9]{6}' ''. You can't simply pull two bits (name and extension) of the string out as one match - regex patterns match contiguous chunks only.
try this ( not tested), but it should works for any 'my_file_name' lenght , any lenght of digit and any kind of extension.
"my_file_name_01012013_111546.xls" -replace '(?<=[\D_]*)(_[\d_]*)(\..*)','$2'
non regex solution:
$a = "my_file_name_01012013_111546.xls"
$a.replace( ($a.substring( ($a.LastIndexOf('.') - 16 ) , 16 )),"")
The original regex you specified returns the maximum match that has 14 characters after it (you can change to (?=.{14}) who is the same).
Once you've changed it, it returns the maximum match that has 14 characters after it + the next 3 characters. This is why you're getting this result.
The approach described by Inductiveload is probably better in case you can use backreferences. I'd use the following regex: (.*)[_\d]{16}\.(.*) Otherwise, I'd do it in two separate stages
get the initial part
get the extension
The reason you get my_filename_01 when you add that is because lookaheads are zero-width. This means that they do not consume characters in the string.
As you stated, .*(?=_.{8}_.{6}) matches my_file_name because that string is is followed by something matching _.{8}_.{6}, however once that match is found, you've only consumed my_file_name, so the addition of .{3} will then consume the next 3 characters, namely _01.
As for a regex that would fit your needs, others have posted viable alternatives.

Extract numbers between brackets within a string [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Extract info inside all parenthesis in R (regex)
I inported data from excel and one cell consists of these long strings that contain number and letters, is there a way to extract only the numbers from that string and store it in a new variable? Unfortunately, some of the entries have two sets of brackets and I would only want the second one? Could I use grep for that?
the strings look more or less like this, the length of the strings vary however:
"East Kootenay C (5901035) RDA 01011"
or like this:
"Thompson-Nicola J (Copper Desert Country) (5933039) RDA 02020"
All I want from this is 5901035 and 5933039
Any hints and help would be greatly appreciated.
There are many possible regular expressions to do this. Here is one:
x=c("East Kootenay C (5901035) RDA 01011","Thompson-Nicola J (Copper Desert Country) (5933039) RDA 02020")
> gsub('.+\\(([0-9]+)\\).+?$', '\\1', x)
[1] "5901035" "5933039"
Lets break down the syntax of that first expression '.+\\(([0-9]+)\\).+'
.+ one or more of anything
\\( parentheses are special characters in a regular expression, so if I want to represent the actual thing ( I need to escape it with a \. I have to escape it again for R (hence the two \s).
([0-9]+) I mentioned special characters, here I use two. the first is the parentheses which indicate a group I want to keep. The second [ and ] surround groups of things. see ?regex for more information.
?$ The final piece assures that I am grabbing the LAST set of numbers in parens as noted in the comments.
I could also use * instead of . which would mean 0 or more rather than one or more i in case your paren string comes at the beginning or end of a string.
The second piece of the gsub is what I am replacing the first portion with. I used: \\1. This says use group 1 (the stuff inside the ( ) from above. I need to escape it twice again, once for the regex and once for R.
Clear as mud to be sure! Enjoy your data munging project!
Here is a gsubfn solution:
library(gsubfn)
strapplyc(x, "[(](\\d+)[)]", simplify = TRUE)
[(] matches an open paren, (\\d+) matches a string of digits creating a back-reference owing to the parens around it and finally [)] matches a close paren. The back-reference is returned.