sas pattern matching with square brackets evaluation

sas pattern matching with square brackets evaluation - regex

I have the following SAS code that checks for patterns and flags any error.
I'm sure that it checks for a pattern in field1, but I'm not sure how two square brackets [] are evaluated.
I need to check for invalid values in field1.
sas code:
if prxmatch('/^[a-zA-Z][a-zA-Z0-9_]*$/', strip(&vfiel1)) = 0 then do;
put "Error is field1"

This regular expression will check for valid-looking SAS name. Specifically, it must start (^) with a letter ([a-zA-Z]) followed by 0 or more (*) letters, numbers, and/or underscores ([a-zA-Z0-9_]) before the end ($).
A better SAS name check would be something along the lines of this:
Libnames: ^[a-zA-Z_][a-zA-Z0-9_]{0,7}$
Dataset & variable names: ^[a-zA-Z_][a-zA-Z0-9_]{0,31}$
Note these allow names to start with an underscore and have max lengths of 8 and 32 characters.
Here is a page on Names in the SAS Language.

Related

Regex lookaround does not work with quantifiers in SAS

I have a table similar to this:
Data have;
text = 'insurance premium'; output;
text = 'insur. premium'; output;
text = 'premium. insur aa'; output;
text = 'premium card'; output;
text = 'sales premium'; output;
Run;
My task is to select all transactions that contain the word premium, but do not contain the word insurance or a form thereof (e.g. insur, ins. etc.). I read up on how to use lookaround expressions in regex and wrote the following expression:
/(?<!ins[a-z.]*\s)premium(?!.*ins[a-z.]*\s)/i
The expression seems to work on testing websites such as https://regexr.com/, but when I run the code below I get an error in SAS:
Data want;
Set have;
re = prxparse('/(?<!ins[a-z.]*\s)premium(?!.*ins[a-z.]*\s)/i');
flg = prxmatch(re, text) > 0;
Run;
ERROR: Variable length lookbehind not implemented before HERE mark in regex m/(?<!insur[a-z.]*\s)premium(?!.*insur[a-z.]*\s) <<
HERE /.
ERROR: Variable length lookbehind not implemented before HERE mark in regex m/(?<!ins[a-z.]*\s)premium(?!.*ins[a-z.]*\s) << HERE /.
ERROR: The regular expression passed to the function PRXPARSE contains a syntax error.
NOTE: Argument 1 to function PRXPARSE('/(?<!ins[a-z'[12 of 45 characters shown]) at line 30 column 6 is invalid.
NOTE: Argument 1 to the function PRXMATCH is missing.
As far as I understood there is an issue with the * symbols inside the lookaround functions, because the error does not occur if I remove them. Does SAS implement such expressions differently or does it simply not support such expressions?

You are using flg = prxmatch(re, text) > 0; to see if there is a match by checking if the position is > 0
You can put the negative lookahead at the start of the string to check for the variations of insurance, and then match the word premium.
^(?!.*\bins[a-z.]*\s).*\bpremium\b
Explanation
^ Start of string
(?! Negative lookahead, assert that on the right is not
.*\bins Match a word starting with ins
[a-z.]*\s Optionally repeat matching chars a-z or . followed by a whitespace char
) Close the lookahead
.*\bpremium\b match the word premium in the line
Regex demo

You cannot use a lookbehind with a variable width pattern in a PCRE regex. However, you can match and skip substrings you do not need using (*SKIP)(*FAIL) verbs, so you can revamp the regex you have in the following way:
prxparse('/ins[a-z.]*\spremium(?!.*ins[a-z.]*\s)(*SKIP)(*F)|premium(?!.*ins[a-z.]*\s)/i')
Mind that patterns are parsed and searched for from left to right. ins[a-z.]*\spremium(?!.*ins[a-z.]*\s)(*SKIP)(*F)| is triggered first, and if ins[a-z.]*\spremium(?!.*ins[a-z.]*\s) is found, it is skipped. Else, the second premium(?!.*ins[a-z.]*\s) alternative comes into play and matches premium not followed with ins and zero or more letters / dots and a whitespace in other contexts.

Converting a normal regular RegEx to one in SAS

Although reading SAS documentation and various example pages, I am struggeling to convert a slightly more complicated RegEx to SAS syntax. I using the command prxchange. This is what I came up so far convert a filename-string like pre_31DEC2019_299792458.xls to an integer number (of length 8) 299792458 inside a SAS data step:
tmp=prxchange('s/pre_([a-zA-Z0-9]{8,9})_([0-9]{1,16})\.xls/\2/g',-1,have);
want=input(tmp,8.);
The error message I have points to somewhere else in the code, but I am rather certain that it is those two lines which cause a problem since leaving out the two quoted lines makes the SAS error message vanish.
References
Inofficial SAS howto on RegEx suggests that I could use standard RegExes.

Why use regex at all?
want = input(scan(have,-2,'._'),32.);

You can use
tmp = prxchange('s/^pre_[A-Za-z0-9]+_([0-9]+)\.xls$/$1/', -1, have);
See the regex demo
Details
s/ - substitution action (we are replacing the match)
^ - start of string
pre_ - a literal prefix
[A-Za-z0-9]+ - one or more alphanumeric ASCII chars (note you may simply use .* here instead if there can be anything)
_ - an underscore
([0-9]+) - Group 1: one or more digits
\.xls$ - .xls at the end of string
$1 - the whole match, the whole string matched, will be replaced with the contents of Group 1.
As far as the prxchange function is concerned, note that it replaces all occurrences of the pattern once you pass -1 as the times argument, thus, no g flag is necessary.

Many ways you could try:
data _null_;
a="pre_31DEC2019_299792458.xls";
b=input(prxchange('s/.*\_(.*)\..*/$1/',-1,a),12.);
c=input(prxchange('s/.*(\d{9}).*/$1/',-1,a),12.);
d=input(prxchange('s/.*(?<=\_)(\d+).*/$1/',-1,a),12.);
put _all_;
run;
.* means any one character many times; for b, the numbers you need are between _and .; for c, it is 9 digitals; for d, it look behind "_" to find digitals.

character 0: character set expected

I want to define a table name by regular expression defined here such that:
Always begin a name with a letter, an underscore character (_), or a
backslash (). Use letters, numbers, periods, and underscore
characters for the rest of the name.
Exceptions: You can’t use "C", "c", "R", or "r" for the name, because
they’re already designated as a shortcut for selecting the column or
row for the active cell when you enter them in the Name or Go To box.
let lex_valid_characters_0 = ['a'-'z' 'A'-'Z' '_' '\x5C'] ['a'-'z' 'A'-'Z' '0'-'9' '.' '_']+
let haha = ['C' 'c' 'R' 'r']
let lex_table_name = lex_valid_characters_0 # haha
But it returns me an error character 0: character set expected.. Could anyone help?

Here is the description of # from the manual:
regexp1 # regexp2
(difference of character sets) Regular expressions regexp1 and regexp2 must be character sets defined with [… ] (or a single character expression or underscore _). Match the difference of the two specified character sets.
The description says the two sets must be character sets defined with [ ... ] but your definition of lex_valid_characters_0 is far more complex than that.
The idea of # is that it defines a pattern that matches exactly one character from a set specified as the difference of two one-character patterns. So it doesn't make sense to apply it to lex_valid_characters_0, which matches strings of arbitrary length.
Update
Here is my thinking on the problem, for what it's worth. There are no extra restrictions on names that are 2 or more characters long (as I read the spec). So it shouldn't be too difficult to specify a regular expression for these names. And it also wouldn't be that hard to come up with a regular expression that defines all the valid 1-character names. The full set of names is the union of these two sets.
You could also use the fact that the longest, first match is the one that applies for ocamllex. I.e., you could have rules for the 4 special cases before the general rule.

Regex with 2 semi colons in notepad++

I have data like this
Giftsbirth;;Basket7;CC
Giftswedding;;Cake4;COD
I am trying to find a regex that will only select the second data (Basket7, Cake4).
From past help I tried something like
^(\w+ [^\v;;]+;;[^\v;]+)?.*
But I know that is not right
Please assist with the regex if you can

You could use a positive lookbehind (?<= to assert what is before is ;; and a positive lookahead (?= to assert that what follows is ;
Use a negative character class [^;]+ to match not a ; to match your values.
(?<=;;)[^;]+(?=;)

You may use
(?:.*;)?([^;\n\r]+);[^;\n\r]+$
Or,
.*?;;([^;\r\n]+)(?:;.*)?
and replace with $1.
Details
(?:.*;)? - an optional substring having 0+ chars other than line break chars, as many as possible, up to the ;
([^;\n\r]+) - Group 1: any one or more chars other than CR, LF and ;
; - a semi-colon
[^;\n\r]+ - any one or more chars other than CR, LF and ;
$ - end of line.
The second regex matches
.*?;; - any 0+ chars as few as possible up to (and including) the first ;;
([^;\r\n]+) - Group 1: any one or more chars other than CR, LF and ;
(?:;.*)? - an optional group matching 1 or 0 occurrences of a ; and then any 0+ chars up to the end of line
The $1 in the replacement is the value you need to keep.

You need to specify more precisely what "the second data (Basket7, Cake4)" means. This looks like CSV data with the ; set as separator, but that would place Basket7 and Cake4 in the third column, since the second column is empty. In order to write a regex that solves this problem in the general case, you need to take into account the full domain of possible lines, and you've only given two examples and let everyone guess what the underlying format and total possible variations might be.
For example, is it always reasonable to assume that that which you're looking for is always preceded by ;; and ends with a ;, and that ;; never occurs in other places than immediately before that which you're looking for? In that case, (?<=;;)([^;]*) captures this. But what if you encounter one of the following lines?
Giftsbirth;;;CC # Here, the thing matched is empty
Giftsbirth;1600;Basket7;CC # Here, the second column isn't empty
;;Basket7;CC # Here, the first column is empty
;;;CC # Here, all but the last column are empty
;;; # Here, all columns are empty
You may experience that various suggestions will give you "the right text", but if you test this on a limited subset that does not account for all variations that can reasonably be expected in the input, you will inevitably have to revise your regex.
Assuming this is a CSV where the fields don't contain literal ;s, and that you don't know anything about the length of any of the fields (and consequently that the second column isn't always empty), but that there are at least three columns, you could consider the regex:
^[^;]*;[^;]*;([^;]*)
(See demo at https://regex101.com/r/vhPNEj/1)
These assumptions may not be correct, but my ability to guess are much worse than yours, since you're sitting with a larger sample size of data. In order to succeed at automating your tasks, it is critical that you learn to modify code to fit your assumptions.
For example, you may want to disregard the cases where the third column is empty:
^[^;]*;[^;]*;([^;]+)
Here the difference is [^;]* changed into [^;]+.
Or you may want to take into account that the first column could contain semicolons when they are wrapped in double quotes, e.g. like "Giftsbirth; Holiday";;Basket7;CC:
^(?:[^;"]*|"[^"]*");[^;]*;([^;]*)
Here the difference is [^;]* changed into (?:[^;"]*|"[^"]*") being either [^;"]* (being all but ; and ") or "[^"]*" (being " followed by anything but ", which includes ;, followed by ").

Hive REGEXP_EXTRACT returning null results

I am trying to extract R7080075 and X1234567 from the sample data below. The format is always a single upper case character followed by 7 digit number. This ID is also always preceded by an underscore. Since it's user generated data, sometimes it's the first underscore in the record and sometimes all preceding spaces have been replaced with underscores.
I'm querying HDP Hive with this in the select statement:
REGEXP_EXTRACT(column_name,'[(?:(^_A-Z))](\d{7})',0)
I've tried addressing positions 0-2 and none return an error or any data. I tested the code on regextester.com and it highlighted the data I want to extract. When I then run it in Zepplin, it returns NULLs.
My regex experience is limited so I have reviewed the articles here on regexp_extract (+hive) and talked with a colleague. Thanks in advance for your help.
Sample data:
Sept Wk 5 Sunny Sailing_R7080075_12345
Holiday_Wk2_Smiles_X1234567_ABC

The Hive manual says this:
Note that some care is necessary in using predefined character classes: using '\s' as the second argument will match the letter s; '\\s' is necessary to match whitespace, etc.
Also, your expression includes unnecessary characters in the character class.
Try this:
REGEXP_EXTRACT(column_name,'_[A-Z](\\d{7})',0)
Since you want only the part without underscore, use this:
REGEXP_EXTRACT(column_name,'_([A-Z]\\d{7})',1)
It matches the entire pattern, but extracts only the second group instead of the entire match.
Or alternatively:
REGEXP_EXTRACT(column_name,'(?<=_)[A-Z]\\d{7}', 0)
This uses a regexp technique called "positive lookbehind". It translates to : "find me an upper case alphabet followed by 7 digits, but only if they are preceded by an _". It uses the _ for matching but doesn't consider it part of the extracted match.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

sas pattern matching with square brackets evaluation - regex

Related

Regex lookaround does not work with quantifiers in SAS

Converting a normal regular RegEx to one in SAS

character 0: character set expected

Regex with 2 semi colons in notepad++

Hive REGEXP_EXTRACT returning null results

Categories

Resources