SAS String comparison - sas

I'm transitioning from SQL Server to SAS.
In SQL server we could get away with string comparisons where 'abc ' = 'aBc' would be true.
Is SAS so far I've had to STRIP and UPPER every string on every comparison.
Is there an option that can be set to allow for 'abc ' = 'aBc' be true ?
My Google-Fu has failed me.

I believe you are looking for the compare function with the 'i' modifier (for ignore case). When this returns a 0 there's a match.
(See p. 70 in here: http://support.sas.com/publishing/pubcat/chaps/59343.pdf)
data a;
input string1 $ string2 $;
datalines;
abc aBc
cba CBA
AbC ABC
AC AbC
BCA CAb
;
run;
data b;
set a;
c = compare(string1,string2);
d = compare(string1,string2,'i');
run;
proc print noobs;
where d = 0;
var string1 string2;
run;

You can try the PRX functions which use Perl Regular Expressions.
'/abc/i' will match anything with the string 'abc' in any case (because of the 'i' after the closing /)
Using PRXMATCH as an example:
prxmatch('/abc/i', 'aBc')
Will return 1 as this is the position that string occurs.
More on regular expressions: https://www.cs.tut.fi/~jkorpela/perl/regexp.html
PRX in SAS:
http://documentation.sas.com/?docsetId=lefunctionsref&docsetVersion=3.1&docsetTarget=n0bj9p4401w3n9n1gmv6tfshit9m.htm&locale=en

Related

SAS mainframe replace 20..99 to 200099

In my the sas mainframe code how to replace . with 0?
data newlic;
INPUT #1 LICNO $10.;
DATALINES;
203....412
...3300421
9955..032.
;
RUN;
PROC PRINT DATA = NEWLIC;
RUN;
DATA MYDATA;
SET NEWLIC;
ARRAY A(*) _NUMERIC_;
DO I=1 TO DIM(A);
IF A(I) = . THEN A(I) = 0;
END;
DROP I;
RUN;
PROC PRINT DATA = MYDATA;
RUN;
my required output
2030000412
0003300421
9955000320
the requirement is to replace '.' to 0
Use a regular expression to replace all non-alphanumeric characters with a 0:
s/[^0-9a-zA-Z]/0/
You can implement regex replacements in SAS with prxchange().
data mydata;
set newlic;
licno = prxchange('s/[^0-9a-zA-Z]/0/', -1, licno);
run;
You can use the TRANSLATE() function to replace unwanted characters with '0'. You can use the COMPRESS() function with d modifier to find any non-digit characters that exist in the value.
fixed=translate(licno,repeat('0',255),compress(licno,,'d'));
Results:
Obs LICNO fixed
1 1234567890 1234567890
2 ABC 9 0000000009
3 203....412 2030000412
4 ...3300421 0003300421
5 9955..032. 9955000320
6 123 1230000000
You can use the regular expression pattern metacharacter \D to locate non-digit characters and replace them with 0 in a use of PRXCHANGE().
From the complete list in the documentation
\d matches a digit character that is equivalent to [0−9].
\D matches any character that is not a digit.
Example:
data have; input
licno $char10.; datalines;
1234567890
ABC 9
203....412
...3300421
9955..032.
123
;
data want;
set have;
fixed = prxchange('s/\D/0/', -1, licno);
run;

SAS Retain not working for 1 string variable

The below code doesn't seem to be working for the variable all_s when there is more than 1 record with the same urn. Var1,2,3 work fine but that one doesn't and I cant figure out why. I am trying to have all_s equal to single_var1,2,3 concatenated with no spaces if it's first.urn but I want it to be
all_s = all_s + ',' + single_var1 + single_var2 + single_var3
when it's not the first instance of that urn.
data dataset_2;
set dataset_1;
by URN;
retain count var1 var2 var3 all_s;
format var1 $40. var2 $40. var3 $40. all_s $50.;
if first.urn then do;
count=0;
var1 = ' ';
var2 = ' ';
var3 = ' ';
all_s = ' ';
end;
var1 = catx(',',var1,single_var1);
var2 = catx(',',var2,single_var2);
var3 = catx(',',var3,single_var3);
all_s = cat(all_s,',',single_var1,single_var2,single_var3);
count = count+1;
if first.urn then do;
all_s = cat(single_var1,single_var2,single_var3);
end;
run;
all_s is not large enough to contain the concatenation if the total length of the var1-var3 values within the group exceeds $50. Such a scenario seems likely with var1-var3 being $40.
I recommend using the length function to specify variable lengths. format will create a variable of a certain length as a side effect.
catx removes blank arguments from the concatenation, so if you want spaces in the concatenation when you have blank single_varN you won't be able to use catx
A requirement that specifies a concatenation such that non-blank values are stripped and blank values are a single blank will likely have to fall back to the old school trim(left(… approach
Sample code
data have;
length group 8 v1-v3 $5;
input group (v1-v3) (&);
datalines;
1 111 222 333
1 . 444 555
1 . . 666
1 . . .
1 777 888 999
2 . . .
2 . b c
2 x . z
run;
data want(keep=group vlist: all_list);
length group 8 vlist1-vlist3 $40 all_list $50;
length comma1-comma3 comma $2;
do until (last.group);
set have;
by group;
vlist1 = trim(vlist1)||trim(comma1)||trim(left(v1));
vlist2 = trim(vlist2)||trim(comma2)||trim(left(v2));
vlist3 = trim(vlist3)||trim(comma3)||trim(left(v3));
comma1 = ifc(missing(v1), ' ,', ',');
comma2 = ifc(missing(v2), ' ,', ',');
comma3 = ifc(missing(v3), ' ,', ',');
all_list =
trim(all_list)
|| trim(comma)
|| trim(left(v1))
|| ','
|| trim(left(v2))
|| ','
|| trim(left(v3))
;
comma = ifc(missing(v3),' ,',',');
end;
run;
Reference
SAS has operators and multiple functions for string concatenation
|| concatenate
cat concatenate
catt concatenate, trimming (remove trailing spaces) of each argument
cats concatenate, stripping (remove leading and trailing spaces) of each argument
catx concatenate, stripping each argument and delimiting
catq concatenate with delimiter and quote arguments containing the delimiter
From SAS 9.2 documentation
Comparisons
The results of the CAT, CATS, CATT, and CATX functions are usually equivalent to results that are produced by certain combinations of the concatenation operator (||) and the TRIM and LEFT functions. However, the default length for the CAT, CATS, CATT, and CATX functions is different from the length that is obtained when you use the concatenation operator. For more information, see Length of Returned Variable.
Note: In the case of variables that have missing values, the concatenation produces different results. See Concatenating Strings That Have Missing Values.
Some example data would be helpful, but I'm going to give it a shot and ask you to try
all_s = cat(strip(All_s),',',single_var1,single_var2,single_var3);

Get string between two specific char positions

i have a long text string in SAS, and a value is within it of variable length but is always proceeded by a '#' and then ends with ' ,'
is there a way i can extract this and store as a new variable please?
e.g:
word word, word, #12.34, word, word
And i want to get the 12.34
Thanks!
Double scan should also work if you only have a single #:
data _null_;
var1 = 'word word, word, #12.34, word, word';
var2 = scan(scan(var1,2,'#'),1,',');
put var2=;
run;
You can make use of the substr and index functions to do this. The index function returns the first position of the character specified.
data _null_;
var1 = 'word word, word, #12.34, word, word';
pos1 = index(var1,'#'); *Get the position of the first # sign;
tmp = substr(var1,pos1+1); *Create a string that returns only characters after the # sign;
put tmp;
pos2 = index(tmp,','); *Get the position of the first "," in the tmp variable;
var2 = substr(tmp,1,pos2-1);
put var2;
run;
Note that this method only works if there is only one "#" in the string.
One way is to use index to locate the two 'sentinels' delimiting the value and retrieve the innards with substr. If the value is supposed to be numeric, an additional use of input function is needed.
A second way is to use a regular expression routines prxmatch and prxposn to locate and extract the embedded value.
data have;
input;
longtext = _infile_;
datalines;
some thing #12.34, wicked
#, oops
#5a64, oops
# oops
oops ,
oops #
ok #1234,
who wants be a #1e6,aire
space # , the final frontier
double #12, jeopardy #34, alex
run;
data want;
set have;
* locate with index;
_p1 = index(longtext,'#');
if _p1 then _p2 = index(substr(longtext,_p1),',');
if _p2 > 2 then num_in_text = input (substr(longtext,_p1+1,_p2-2), ?? best.);
* locate with regular expression;
if _n_ = 1 then _rx = prxparse('/#(\d*\.?\d*)?,/'); retain _rx;
if prxmatch(_rx,longtext) then do;
call prxposn(_rx,1,_start,_length);
if _length > 0 then num_in_text_2 = input (substr(longtext,_start, _length), ?? best.);
end;
* drop _: ;
run;
The regex way looks for ##.## variants, the index way looks only for #...,. Then input function will decipher scientific notation values the regex (example pattern)way will not 'locate'. The ?? option in the input function prevents invalid arguments NOTE:s in the log when the enclosed value can not be parsed as a number.
Another way to do is by using Regex and code is given below
data have;
infile datalines truncover ;
input var $200.;
datalines;
word word, word, #12.34, word, word
word1 #12.34, hello hi hello hi
word1 #970000 hello hi hello hi #970022, hi
word1 123, hello hi hello hi #97.99
#99456, this is cool
;
A small note about below regular expression and functions
(?<=#) Zero-width positive look-behind assertion and looking for # before the pattern of interest
(\d+.?\d+) here means digit followed or not followed by . and other digits
(?=,) Zero-width positive look-ahead assertion and looking for , after the pattern of interest
call prxsubstr finds the position and length of pattern and substr extracts the required values.
data want( drop=pattern position length);
retain pattern;
IF _N_ = 1 THEN PATTERN = PRXPARSE("/(?<=#)(\d+\.?\d+)(?=,)/");
set have;
call prxsubstr(pattern, var, position, length);
if position then
match = substr(var, position, length);
run;
if you want to get really lazy you can just do
want = compress(have,".","kd");

Regular Expression dot in SAS

I'm new in this field, and try to use prxmatch and rxmatch to match some strings.
The pattern is a., which matches a string with more than 2 characters and a isn't the last one.
I run prxmatch('/a./', 'a') and rxmatch('/a./', 'a'), the result should be 0. But the system returns me 1.
So how can I get 0 in this case?
If you write a MCVE for this, you do get no match.
data test;
x='a';
rc=prxmatch('~a.~',x);
put x= rc=;
run;
However, if x is not length 1, it will match!
data test;
length x $5;
x='a';
rc=prxmatch('~a.~',x);
put x= rc=;
run;
Why?
Because in SAS, strings are not varchar, they are char. They have spaces padding the rest of the string out to its full length. So you would need to do either
data test;
length x $5;
x='a';
rc=prxmatch('~a[^ ]~',x);
put x= rc=;
run;
or, better,
data test;
length x $5;
x='a';
rc=prxmatch('~a.~',trim(x));
put x= rc=;
run;
(Note, I use ~ for my regex delimiter - you're free to use slash, or any other character, for that, it makes no difference.)

How to pad out character fields in SAS?

I am creating a SAS dataset from a database that includes a VARCHAR(5) key field.
This field includes some entries that use all 5 characters and some that use fewer.
When I import this data, I would prefer to pad all the shorter entries out to use all five characters. For this example, I want to pad on the left with 0, the character zero. So, 114 would become 00114, ABCD would become 0ABCD, and EA222 would stay as it is.
I've attempted this with a simple data statement, but of course the following does not work:
data test;
set databaseinput;
format key $5.;
run;
I've tried to do this with a user-defined informat, but I don't think it's possible to specify the ranges correctly on character fields, per this SAS KB answer. Plus, I'm fairly sure proc format won't let me define the result dynamically in terms of the incoming variable.
I'm sure there's an obvious solution here, but I'm just missing it.
Here is an alternative:
data padded_data_dsn; length key $5;
drop raw_data;
set raw_data_dsn(rename=(key=raw_data));
key = translate(right(raw_data),'0',' ');
run;
Data raw_data_dsn;
format key $5.;
key = '4'; key1 = CATT(REPEAT('0',5-length(key)),key);output;
key = 'A114'; key1 = CATT(REPEAT('0',5-length(key)),key);output;
key = 'A1140'; key1 = CATT(REPEAT('0',5-length(key)),key);output;
run;
I'm sure someone will have a more elegant solution, but the following code works. Essentially it is padding the variable with five leading zeros, then reversing the order of this text string so that the zeros are to the right, then reversing this text string again and limiting the size to five characters, in the original order but left-padded with zeros.
data raw_data_dsn;
format key $varying5.;
key = '114'; output;
key = 'ABCD'; output;
key = 'EA222'; output;
run;
data padded_data_dsn;
format key $5.;
drop raw_data;
set raw_data_dsn(rename=(key=raw_data));
key = put(put('00000' || raw_data ,$revers10.),$revers5.);
run;
Here's what worked for me.
data b (keep = str2);
format str2 $5. ;
set a;
catlength = 4 - length(str);
cat = repeat('0', catlength);
str2 = catt(cat, str);
run;
It works by counting the length of the existing string, and then creating a cat string of length 4 - that, and then appending the cat value and the original string together.
Notice that it screws up if the original string is length 5.
Also - it won't work if the input string has a $5. format on it.
data a; /*input dataset*/
input str $;
datalines;
a
aa
aaa
aaaa
aaaaa
;
run;
data b (keep = str2);
format str2 $5. ;
set a;
catlength = 4 - length(str);
cat = repeat('0', catlength);
str2 = catt(cat, str);
run;
input:
a
aa
aaa
aaaa
aaaaa
output:
0000a
000aa
00aaa
0aaaa
0aaaa
I use this, but only works with numeric values :S. Try with another formats in the INPUT
data work.prueba;
format xx $5.;
xx='1234';
vv=PUT(INPUT(xx,best5.),z5.);
run;