SAS lowercase SHA256 string - sas

Currently testing a SHA256 function in order to prepare a variable for use in another application.
The user has requested the SHA256 result be in lower case. I created a quick record in order to make sure I can convert the string-
data have;
input first $ last $ dob $ 10. sex $;
cards;
test person 1955-07-31 1
;
run;
Seems it will not allow a lower case string once passed through the SHA function.
Is there a workaround for this? The below attempt did not yield desirable results.
data have2;
set have;
source = catt(first,last,dob,sex);
encryp = lowcase(sha256(source));
format encryp $hex64.;
run;

The issue is not with the sha256 function, but with the $HEX64 format.
When you used lowcase you actually do some harm to the SHA256 result: you're not altering the hexadecimal representation, but you're actually altering the characters themselves, which means your result isn't accurate - and then you're displaying them with $HEX64. which will always show capital letters for the hexadecimal characters.
What instead you want, presumably, is to store the lower case version of the $HEX64. format. You can do that with put:
data want;
set have;
source = catt(first,last,dob,sex);
encryp = sha256(source);
lower = lowcase(put(encryp,$HEX64.));
run;
Note what encryp looks like - something totally different, and probably not particularly useful. You can of course skip that step if you want.

Below will do it using a put statement with the format inside:
data have2;
set have;
encryp = lowcase(put(sha256(catt(first,last,dob,sex)),$hex64.));
run;
It will show an entirely different encryption code compared to your method but it remains consistent.
data have;
input first $ last $ dob $ 10. sex $;
cards;
test person 1955-07-31 1
test person 1955-07-31 1
test2 person 1977-08-11 2
test3 person 1945-12-22 1
;
run;
data have2;
set have;
new_encryp = lowcase(put(sha256(catt(first,last,dob,sex)),hex64.)); /* new method */
encryp = lowcase(sha256(catt(first,last,dob,sex))); /* what you tried */
format encryp $hex64.;
run;
/* output */
first last dob sex new_encryp encryp
test person 7/31/1955 1 038a855a47f40edf54094adc4366e3e79c1a931346d7968e96d2cb930b01e7bc 039A857A67F40EDF74096AFC6366E3E79C1A931366D7969E96F2EB930B01E7BC
test person 7/31/1955 1 038a855a47f40edf54094adc4366e3e79c1a931346d7968e96d2cb930b01e7bc 039A857A67F40EDF74096AFC6366E3E79C1A931366D7969E96F2EB930B01E7BC
test2 person 8/11/1977 2 1117ab614f48a7edfbe9d615f12acad9d564b457b0f31bb2619f7eb9b10f1e58 1117AB616F68A7EDFBE9F615F12AEAF9F564B477B0F31BB261FF7EB9B10F1E78
test3 person 12/22/1945 1 d1cb00ebe044c0553039f99592dc7bd4804eac2c13da8208fd82459c3a37efd1 F1EB00EBE064E0753039F99592FC7BF4806EAC2C13FA8208FD82659C3A37EFF1

*convert case;
%let txt="SEQ_CLAIM_ID, SEQ_MEMB_ID, EFFECTIVE_DATE";
%macro lower_case(txt);
data;
text=lowcase(&txt);
run;
%mend;
%lower_case(&txt);

Related

Combine one column's values into a single string

This might sound awkward but I do have a requirement to be able to concatenate all the values of a char column from a dataset, into one single string. For example:
data person;
input attribute_name $ dept $;
datalines;
John Sales
Mary Acctng
skrill Bish
;
run;
Result : test_conct = "JohnMarySkrill"
The column could vary in number of rows in the input dataset.
So, I tried the code below but it errors out when the length of the combined string (samplkey) exceeds 32K in length.
DATA RECKEYS(KEEP=test_conct);
length samplkey $32767;
do until(eod);
SET person END=EOD;
if lengthn(attribute_name) > 0 then do;
test_conct = catt(test_conct, strip(attribute_name));
end;
end;
output; stop;
run;
Can anyone suggest a better way to do this, may be break down a column into chunks of 32k length macro vars?
Regards
It would very much help if you indicated what you're trying to do but a quick method is to use SQL
proc sql NOPRINT;
select name into :name_list separated by ""
from sashelp.class;
quit;
%put &name_list.;
As you've indicated macro variables do have a size limit (64k characters) in most installations now. Depending on what you're doing, a better method may be to build a macro that puts the entire list as needed into where ever it needs to go dynamically but you would need to explain the usage for anyone to suggest that option. This answers your question as posted.
Try this, using the VARCHAR() option. If you're on an older version of SAS this may not work.
data _null_;
set sashelp.class(keep = name) end=eof;
length long_var varchar(1000000);
length want $256.;
retain long_var;
long_var = catt(long_var, name);
if eof then do;
want = md5(long_var);
put want;
end;
run;

SAS locate country from the address string?

I have a list of customer addresses in a dataset where I am trying to locate the Country of residence, for example: NEWSOUTHWALESAUSTRALIA could be indexed to report the country as Australia. I am trying to use the do loop approach to scan through the list of 252 Countries to relate the Country of residence from a dataset called address_format
The dataset test has the list of 252 Countries which have upcased & compressed, as has the field concat_address, so should no issues with differences in the text.
%macro counter;
%do ii = 1 %to 252;
data test;
set country_data (obs=&ii.);
call symput('New_upcase_country',trim(New_upcase_country));
country_new = compress(trim(country_two));
call symput('country_new',trim(country_new));
run;
data ADDRESS_FORMAT_NEW;
set ADDRESS_FORMAT;
length success $70.;
format success $70.;
if index(concat_address,"&country_new.") ge 1
then do ;
country="&country_new.";
end;
run;
%end;
%mend;
%counter;
For some odd reason If I manually programme if index(concat_address,'AUSTRALIA'), I get results, but inside the macro the results are blank.
Is there something obvious I am missing that is preventing the success of the country index?
The obs= option can be thought of as lastobs= (there is no lastobs option).
The option for skipping the first n-1 observations is firstobs=
This example will yield 4 rows (8-5+1)
data class;
set sashelp.class (firstobs=5 obs=8);
run;
So you want
firstobs=&ii obs=&ii, or
firstobs=&ii and a STOP;RUN; to prevent going to row &ii+1 and beyond.
Despite the above answer, I would recommend switching to a no macro approach that does all 252 checks in one data step (versus one step per check). There are numerous ways of doing such, here is one way that does not use arrays or hashes
For example:
data have;
input;
text = _infile_;
datalines;
BONGOBONGOAUSTRALIA
SOMEWHERE IN CHINA
CANADA USA
TIBET LANE, NORWAY
run;
data countries;
length name $50;
input;
name = _infile_;
datalines;
AUSTRALIA
GERMANY
UNITED STATES
TIBET
NORWAY
run;
Output only first match. An important code feature is using point= and nobs= options on the set statement inside the do loop.
data want;
set have;
do index = 1 to check_count until (found);
set countries point=index nobs=check_count;
found = index(text,trim(name));
if found then matched_country = name;
end;
run;
Output all matches
data want (keep=text matched_country);
set have;
do index = 1 to check_count;
set countries point=index nobs=check_count;
found = index(text,trim(name));
if found then do;
found_count = sum(found_count,1);
matched_country = name;
output;
end;
end;
if not found_count > 0 then do;
matched_country = '** NO MATCH **';
output;
end;
run;

SAS Perl Regular Expressions: How to write correct syntax?

I have some complicated string parsing which would be very difficult to accomplish using regular SAS functions because of the string value inconsistency; as a result
I think I will need to use Perl Regular Expressions. Below have 4 variables (price, date, size, bundle) which I have to create using parts of the text string. I'm have trouble getting the syntax correct - I am new to regular expressions.
Here is a sample data set.
data have;
infile cards truncover;
input text $80.;
cards;
acq_newsale_0_CartChat_0_Flash_1192014.jpg
acq_old_3x_GadgetPotomac_7999_Flash_112014.swf
acq_sale_3xconoffer_8999_nacpg_2102014.sfw
acq_is_3X_ItsEasy_8999_NACPG_Flash_272014_728x90.hgp
awa_os_3xMZ1_FiOSPresents_FF_160x600_12252014.mov
awa_fs_0_TWCMLP_v2_switch_0_0_Static_462014_300x250.jpg
acq_fi_2x_incrediblemz1_7999_nac_flash_1192014_160x600.swf
acq_fio_3x_bringhome_6499_0_flash_12162013_728x90.swf
;run;
/The first variable is price it is normally located near the end or middle of the string/
data want;
set have;
price =(input(prxchange('s/(\w+)_(\d+)_(\w+)/$2/',-1,text),8.))/100;
format price dollar8.2;
run;
Using the data set above I need to have this result:
price
0
79.99
89.99
89.99
79.99
64.99
/Date is always a series of consecutive digits. Either 6, 7 or 8. Using | which means 'or' I thought I would be able to pull that way/
data want;
set have;
date=prxparse('/\d\d\d\d\d\d|\d\d\d\d\d\d\d|\d\d\d\d\d\d\d\d/',text);
run;
Using the data set above I need to have this result:
Date
1192014
112014
2102014
272014
12252014
462014
1192014
12162013
/* For size there is always an ‘x’ in the middle of the sub-string which is with followed by two or three digits on either side*/
data want;
set have;
size=prxparse('/(\w+)_(\d+)'x'(\d+)_(\w+)/',text);
run;
Size
728x90
160x600
300x250
160x600
728x90
/*This is normally located towards the beginning of the string. It’s always a single digit number followed by an x It in never followed by additional digits but can also be just 0. */
data want;
set have;
Bundle=prxparse('/(\d+)'x'',text);
run;
Bundle
0
3x
3x
3X
3x
0
2x
3x
The final product I am looking for should look like this:
Text Date price Size Bundle
acq_newsale_0_CartChat_0_Flash_1192014.jpg 1192014 0 0
acq_old_3x_GadgetPotomac_7999_Flash_112014.swf 112014 79.99 3x
acq_sale_3xconoffer_8999_nacpg_2102014.sfw 2102014 89.99 3x
acq_is_3X_ItsEasy_8999_NACPG_Flash_272014_728x90.hgp 272014 89.99 728x90 3X
awa_os_3xMZ1_FiOSPresents_FF_160x600_12252014.mov 12252014 160x600 3x
awa_fs_0_TWCMLP_v2_switch_0_0_Static_462014_300x250.jpg 462014 300x250 0
acq_fi_2x_incrediblemz1_7999_nac_flash_1192014_160x600.swf 1192014 79.99 160x600 2x
acq_fio_3x_bringhome_6499_0_flash_12162013_728x90.swf 12162013 64.99 728x90 3
x
If you're extracting, don't use PRXCHANGE. Use PRXPARSE, PRXMATCH, and PRXPOSN.
Sample usage, with date:
data have;
infile cards truncover;
input text $80.;
cards;
acq_newsale_0_CartChat_0_Flash_1192014.jpg
acq_old_3x_GadgetPotomac_7999_Flash_112014.swf
acq_sale_3xconoffer_8999_nacpg_2102014.sfw
acq_is_3X_ItsEasy_8999_NACPG_Flash_272014_728x90.hgp
awa_os_3xMZ1_FiOSPresents_FF_160x600_12252014.mov
awa_fs_0_TWCMLP_v2_switch_0_0_Static_462014_300x250.jpg
acq_fi_2x_incrediblemz1_7999_nac_flash_1192014_160x600.swf
acq_fio_3x_bringhome_6499_0_flash_12162013_728x90.swf
;
run;
data want;
set have;
rx_date = prxparse('~(\d{6,8})~io');
rc_date = prxmatch(rx_date,text);
if rc_date then datevar = prxposn(rx_date,1,text);
run;
Just enclose in parens the section you want to extract (in this case, all of it).
Date was easy - as you say, 6-8 numbers. The others may be harder. The 3x etc. bit you can probably find, depending on how strict you need to be; the price I think you'll have a very hard time finding. You need to be able to better articulate the rules. "Towards the beginning" isn't a regex rule. "The second set of digits" is; "The second to last set", perhaps might work. I'll see if I can figure out a few.
In your example data, this works. I in particular don't like the price search; that one may well fail with a more complicated set of data. You can figure out adding the decimal for yourself.
data have;
infile cards truncover;
input text $80.;
cards;
acq_newsale_0_CartChat_0_Flash_1192014.jpg
acq_old_3x_GadgetPotomac_7999_Flash_112014.swf
acq_sale_3xconoffer_8999_nacpg_2102014.sfw
acq_is_3X_ItsEasy_8999_NACPG_Flash_272014_728x90.hgp
awa_os_3xMZ1_FiOSPresents_FF_160x600_12252014.mov
awa_fs_0_TWCMLP_v2_switch_0_0_Static_462014_300x250.jpg
acq_fi_2x_incrediblemz1_7999_nac_flash_1192014_160x600.swf
acq_fio_3x_bringhome_6499_0_flash_12162013_728x90.swf
blahblah :23 blahblah
blahblahblah 23 blah blah
;
run;
data want;
set have;
rx_date = prxparse('~_(\d{6,8})[_\.]~io');
rx_price = prxparse('~_(\d+)_.*?(?=_\d+[_\.]).*?(?!_\d+[_\.])~io');
rx_bundle = prxparse('~(?!_\d+_)_(\dx)~io');
rx_size = prxparse('~_(\d+x\d+)[_\.]~io');
rx_adnum = prxparse('~\s:?(\d\d)\s~io');
rc_date = prxmatch(rx_date,text);
rc_price = prxmatch(rx_price,text);
rc_bundle = prxmatch(rx_bundle,text);
rc_size = prxmatch(rx_size,text);
rc_adnum = prxmatch(rx_adnum,text);
if rc_date then datevar = prxposn(rx_date,1,text);
if rc_price then price = prxposn(rx_price,1,text);
if rc_bundle then bundle = prxposn(rx_bundle,1,text);
if rc_size then size = prxposn(rx_size,1,text);
if rc_adnum then adnum = prxposn(rx_adnum,1,text);
run;

How to combine text and numbers in catx statement

The variable upc is already defined in my cool dataset. How do I convert it to a macro variable? I am trying to combine both text and numbers. For example blah should equal upc=123;
data cool;
set cool;
blah = catx("","upc=&upc","ccc")
run;
If upc is a numeric variable and you just want to include its value into some character string then you don't need to do anything special. Concatenation function will convert it into character before concatenating automatically:
data cool;
blah = catx("","upc=",upc,"ccc");
run;
The result:
upc----blah
123 upc= 123ccc
BTW, if you want to concatenate strings without blanks between them, you can use function CATS(), which strips all leading and trailing spaces from each argument.
The following test code works for my SAS 9.3 x64 PC.
Please note that:
1.symputx() provide the connection between dataset and macro variables.
2.cats() will be more appropriate than catx() if delimiting characters are not needed.
3.If you did not attempt to create a new data set, data _NULL_ is fine.
You can check the log to see that the correct values are being assigned.
Bill
data a;
input ID $ x y ##;
datalines;
A 1 10 A 2 20 A 3 30
;
run;
options SymbolGen MPrint MLogic MExecNote MCompileNote=all;
data _NULL_;
set a;
call symputx(cats("blah",_N_),cats(ID,x),"G");
run;
%put blah1=&blah1;
%put blah2=&blah2;
%put blah3=&blah3;

How to create a new variable in SAS by extracting part of the value of an existing numeric variable?

I have two datasets in SAS that I would like to merge, but they have no common variables. One dataset has a "subject_id" variable, while the other has a "mom_subject_id" variable. Both of these variables are 9-digit codes that have just 3 digits in the middle of the code with common meaning, and that's what I need to match the two datasets on when I merge them.
What I'd like to do is create a new common variable in each dataset that is just the 3 digits from within the subject ID. Those 3 digits will always be in the same location within the 9-digit subject ID, so I'm wondering if there's a way to extract those 3 digits from the variable to make a new variable.
Thanks!
SQL(using sample data from Data Step code):
proc sql;
create table want2 as
select a.subject_id, a.other, b.mom_subject_id, b.misc
from have1 a JOIN have2 b
on(substr(a.subject_id,4,3)=substr(b.mom_subject_id,4,3));
quit;
Data Step:
data have1;
length subject_id $9;
input subject_id $ other $;
datalines;
abc001def other1
abc002def other2
abc003def other3
abc004def other4
abc005def other5
;
data have2;
length mom_subject_id $9;
input mom_subject_id $ misc $;
datalines;
ghi001jkl misc1
ghi003jkl misc3
ghi005jkl misc5
;
data have1;
length id $3;
set have1;
id=substr(subject_id,4,3);
run;
data have2;
length id $3;
set have2;
id=substr(mom_subject_id,4,3);
run;
Proc sort data=have1;
by id;
run;
Proc sort data=have2;
by id;
run;
data work.want;
merge have1(in=a) have2(in=b);
by id;
run;
an alternative would be to use
proc sql
and then use a join and the substr() just as explained above, if you are comfortable with sql
Assuming that your "subject_id" variable is a number then the substr function wont work as sas will try convert the number to a string. But by default it pads some paces on the left of the number.
You can use the modulus function mod(input, base) which returns the remainder when input is divided by base.
/*First get rid of the last 3 digits*/
temp_var = floor( subject_id / 1000);
/* then get the next three digits that we want*/
id = mod(temp_var ,1000);
Or in one line:
id = mod(floor(subject_id / 1000), 1000);
Then you can continue with sorting the new data sets by id and then merging.