I have a dataset that consists of variables named month0-month120 and for each record I am trying to check if these variables equal a particular value. I am having a bit of trouble trying to do this dynamically rather than writing 120 lines of code. How would be the proper way to accomplish this? I am also having trouble formulating how to word the question which is also hindering me when searching online.
Edit: So basically I have this time series of values from the last 5 years represented in month0-120. I am trying to see how many '.' values are present within this array for each record. An example of input is as such
data testing;
set blah;
len = 0;
do i = 0 to 120;
if month[i] = . then len+1;
end;
run;
To count the number of missing values use NMISS().
data testing;
set blah;
len = nmiss(of month0-month120);
run;
Note CMISS() will also work since CMISS() works with both numeric and character variables.
For more general solution for referencing a set of variables use an ARRAY.
data testing;
set blah;
array months month0-month120;
do index=1 to dim(months);
* do something with MONTHS[index] ;
end;
run;
For the code you posted, you would need to explicitly declare your array and it's easier if you specify the index from 0 to 120. Otherwise, SAS would index it from 1 to 121 essentially.
data testing;
set blah;
array months(0:120) month0-month120;
len = 0;
do i = 0 to 120;
if month[i] = . then len+1;
end;
run;
Overview:
What I have: 81A8221704654BAC9B24CDFE5A834541 (hex string)
What I want: 172343408754022659539630353407870715201 (decimal equivalent as string)
Detail:
I am exporting data from SAS to a 3rd party platform that performs better when the unique key is in numeric format. The key to the data I'm using is currently stored as a 32 char string containing a hex value, e.g. 81A8221704654BAC9B24CDFE5A834541.
The issue I'm facing is that the largest integer value available in SAS on Linux is 9007199254740992 in decimal (or 20000000000000 in hex). The 32 character hex value when converted to decimal will be much larger and any attempt to do conversion using numeric values will lose precision.
The only two approaches I can think of, are to either write some functions that do addition of two strings (ugly+slow), or perhaps someone has a clever way using some bitwise logical functions?
Any suggestions?
You can use Java BigInteger.
Groovy
Write data to file and read conversion back
data have;
input hexstr $32.;
datalines;
81A8221704654BAC9B24CDFE5A834540
81A8221704654BAC9B24CDFE5A834541
81A8221704654BAC9B24CDFE5A834542
81A8221704654BAC9B24CDFE5A834543
;
proc export data=have replace outfile='c:\temp\have-hexstr.dat' dbms=tab;
putnames=no;
run;
proc groovy;
submit;
File out = new File("c:\\temp\\want-decstr.dat");
out.write "hexstr,decstr\n";
new File ("c:\\temp\\have-hexstr.dat").eachLine { line ->
out.append(line+",\""+new BigInteger(line, 16).toString()+"\"\n");
}
endsubmit;
quit;
options source;
proc import datafile='c:\temp\want-decstr.dat' replace out=want dbms=csv;
run;
JavaObj
You could create a jar that contains a Converter class for performing the conversion and call the method from a JavaObj. The jar would have to be added to the SAS session classpath at startup.
public class Converter
{
public string hexStrToDecStr(hexStr)
{
return new java.math.BigInteger(hexStr,16).toString();
}
}
data want;
set have;
if _n_ = 1 then do;
declare javaobj j('Converter');
end;
length decStr $50;
j.callStringMethod('hexStrToDecStr', hexStr, decStr);
run;
Ideal JavaObj
Can't do this because JavaObj can only pass double values to methods :(
data want;
set have;
length decStr $50;
* wont work, cause 16 passed as double! :(;
declare javaobj j ('java/math/BigInteger', hexStr, 16);
j.callStringMethod('toString', decStr);
j.delete();
run;
Ok, here goes... I tested four approaches:
Ugly+Slow. My own SAS code that implements the ability to convert a 32 char string containing a hex value into a very big integer.
Process externally to SAS then join the data back. I chose Java for it's ubiquity as well as its existing support for BigIntegers, and the resulting simple code.
Run some Python code from inside a SAS proc fcmp function to perform the conversion. The FCMP function can then be used within a datastep.
(per Richard's answer) Use proc groovy to perform the conversion.
Results:
My SAS implementation took ~15 minutes to take the same hex string 1
million times and convert it to a big integer (stored as a string).
The java implementation took ~3 seconds to generate 1 million hex
strings and convert them to big integers. This doesn't include any additional time that would be required to export the data, and join it back.
The proc fcmp + python solution took ~90 seconds to process 1 million UUIDs or hex string equivalents and convert them all within a SAS datastep.
Richard's proc groovy solution took ~10 seconds to process 1 million observations including writing them to disk and reading them back in. Joining these back to a wider dataset would incurr additional time/IO cost.
With the huge disparity in timing, the vanilla SAS option is clearly not the solution use unless you don't have the ability to implement any of the other solutions. The one advantage it does have is that it's all done natively within SAS - so if your admins have your system on lockdown, you can still achieve the conversion, albeit slower.
Between the other 3 solutions, I think the best solution depends on your use case:
The proc fcmp solution has the advantage that it makes for cleaner / self-contained SAS code that's easy to maintain/read/use and runs from within a datastep (something none of the other solutions can claim), but at the cost of running a little slower. I think this would be my preferred choice if I wasn't dealing with big-data performance issues.
The proc groovy solution is also nice in that all of the code can live within SAS and no external programs/JAR files are required. The downside is that it does incur the additional complexity of having to dump the hex values to disk, convert them, read them back in, and finally join back to the original dataset. Even with the additional steps, depending on the size/width of the original dataset, this is likely to be faster than the above 2 options.
The Java solution to me doesn't offer any benefits over the proc groovy solution . It may run marginally faster (untested), but even if it did, it requires compilation of the JAR file, and existing external .java source file, etc.. so leaves a bit more of a mess lying around.
Disclaimer:
My expertise isn't low level computer science coding - I'm sure there's ways to improve my vanilla SAS code approach but I don't know how so if you can suggest any improvements feel free to chime in.
#1 - SAS Implementation
The approach I took in SAS was as follows:
Slice the 32 character have string into 4 8-character strings.
Convert each of these 8 character strings from hex to decimal. These are small enough that SAS is able to store them as numeric values without losing precision and we can make use of the existing hex informat.
Multiply each of these 4 numerics by the appropriate base-16 exponents to 'resize' them accordingly. The exponents were precalculated outside of SAS using a big number calculator as 16^16 and 16^24 are too big for SAS to handle. To multiply these numbers together I created an FCMP function called multiply_strings.
Add the 4 resulting strings from step 3 together using an FCMP function I created called add_strings.
Below is some test code that performs the above steps and shows the workings, the steps are commented:
data _null_;
length front back final $80;
* CALCULATED USING A BIG NUMBER CALCULATOR. DONT PERFORM THESE CALCULATIONS IN SAS;
exp1 = '1'; * 16 ** 0;
exp2 = '4294967296'; * 16 ** 8;
exp3 = '18446744073709551616'; * 16 ** 16;
exp4 = '79228162514264337593543950336'; * 16 ** 24;
have = '81A8221704654BAC9B24CDFE5A834541';
correct = '172343408754022659539630353407870715201'; * TESTED USING THE SEARCH TERM: 81A8221704654BAC9B24CDFE5A834541 hex as decimal ;
* STEP1;
hex1 = upcase(substr(have,25,8));
hex2 = upcase(substr(have,17,8));
hex3 = upcase(substr(have, 9,8));
hex4 = upcase(substr(have, 1,8));
*STEP2;
part1 = input(hex1,hex8.);
part2 = input(hex2,hex8.);
part3 = input(hex3,hex8.);
part4 = input(hex4,hex8.);
*STEP3;
product1 = part1;
product2 = multiply_strings(put(part2,32.0), exp2, 0);
product3 = multiply_strings(put(part3,32.0), exp3, 0);
product4 = multiply_strings(put(part4,32.0), exp4, 0);
* STEP4;
* SHOULD CHANGE ADD STRINGS TO ACCEPT 4 PARAMS BUT WILL SEE HOW PERFORMANCE COMPARES FIRST;
front = add_strings(product1, product2);
back = add_strings(product3, product4);
final = add_strings(front , back );
put #3 final=;
put correct=;
put '-';
put have=;
put hex1=;
put hex2=;
put hex3=;
put hex4=;
put '-';
put part1=;
put part2=;
put part3=;
put part4=;
put '-';
put product1=;
put product2=;
put product3=;
put product4=;
put '-';
put exp1=;
put exp2=;
put exp3=;
put exp4=;
%runquit;
And the results from the above step:
final=172343408754022659539630353407870715201
correct=172343408754022659539630353407870715201
-
have=81A8221704654BAC9B24CDFE5A834541
hex1=5A834541
hex2=9B24CDFE
hex3=04654BAC
hex4=81A82217
-
part1=1518552385
part2=2602880510
part3=73747372
part4=2175279639
-
product1=1518552385
product2=11179286665845800960
product3=1360398897392653722978353152
product4=172343408752662260631058413017528008704
-
exp1=1
exp2=4294967296
exp3=18446744073709551616
exp4=79228162514264337593543950336
The code I used to run it 1 million times was this (I smashed it all together on the off-chance it would run faster, it didn't):
data _null_;
length front back final $80;
* CALCULATED USING A BIG NUMBER CALCULATOR. DONT PERFORM THESE CALCULATIONS IN SAS;
part1_exp = '1'; * 16 ** 0;
part2_exp = '4294967296'; * 16 ** 8;
part3_exp = '18446744073709551616'; * 16 ** 16;
part4_exp = '79228162514264337593543950336'; * 16 ** 24;
have = '81A8221704654BAC9B24CDFE5A834541';
correct = '172343408754022659539630353407870715201'; * TESTED USING THE SEARCH TERM: 81A8221704654BAC9B24CDFE5A834541 hex as decimal ;
do blah=1 to 1000000;
front = add_strings(input(upcase(substr(have,25,8)),hex8.), multiply_strings(put(input(upcase(substr(have,17,8)),hex8.),32.0), part2_exp, 0));
back = add_strings(multiply_strings(put(input(upcase(substr(have, 9,8)),hex8.),32.0), part3_exp, 0), multiply_strings(put(input(upcase(substr(have, 1,8)),hex8.),32.0), part4_exp, 0));
final = add_strings(front , back );
end;
run;
FCMP function: add_strings()
This function simply performs addition as you would write it out by hand. Obviously we're trying to avoid converting any significant part of it into a SAS numeric. Perhaps chunking it and using numeric addition would be faster but I'd already spent enough time writing it this way and time constraints etc.....
proc fcmp outlib=work.funcs.funcs;
function add_strings(str1 $, str2 $ ) $;
length result rev1 rev2 $40 digit $1;
len = max(lengthn(cats(str1)), lengthn(cats(str2)));
rev1 = cats(reverse(str1));
rev2 = cats(reverse(str2));
max_result_length = len + 1;
carry = 0;
result = '';
do cnt=1 to max_result_length;
s = sum(input(substr(rev1,cnt,1),best.), input(substr(rev2,cnt,1),best.), carry);
digit = mod(s,10);
carry = floor(s/10);
result = cats(digit, result);
end;
/* REMOVE LEADING ZEROS */
result_len = lengthn(cats(result));
do first_non_zero=1 to result_len;
if substr(result, first_non_zero, 1) ne '0' then do;
leave;
end;
end;
result = cats(substr(result,min(first_non_zero,result_len),result_len));
return(result);
endsub;
run;
FCMP function: multiply_strings()
This function multiplies two large strings containing very big integers together... It should be accurate to any number where the result is no more than 80 characters but I haven't tested the boundary conditions. The third option is simply to print debugging information to the log, set to 1 to print it.
This was a little trickier than the add_strings() function as typically multiplication by hand (at least how I do it) requires you to add numbers together as part of the process. We want to avoid doing this as these numbers will be very large and we would have had to use the add_strings() function to do it (which I already knew was slow from testing).
Instead I found this approach to multiplication which would avoid having to add any numbers together that would be too large for SAS to store: https://www.youtube.com/watch?v=b_xUE4wkVKY . I don't know if this approach has a proper name or not, I couldn't find a lot of information on it, my google-fu was lacking.
proc fcmp outlib=work.funcs.funcs;
function multiply_strings(str_a $, str_b $, debug ) $;
length result $80 rev_a rev_b $40 digit $1;
array a [40] a1-a40; * STR_A AS AN ARRAY;
array b [40] b1-b40; * STR_B AS AN ARRAY;
array base [80] base1-base80; * PRODUCT OF STR_A AND STR_B ELEMENTS.;
array r [80] r1-r80; * RESULT;
**
** CANT DYNAMICALLY SIZE ARRAY P UNFORTUNATELY (AFAIK).
** LOTS OF WASTED SPACE AND PROBABLY SLOWER BUT EASIER TO CODE/DEBUG I SUPPOSE.
** 39 LEVELS. 2 ROWS PER LEVEL. 80 CELLS PER ROW.
*;
array p [39,2,80] p1-p6240;
a_len = lengthn(str_a);
b_len = lengthn(str_b);
rev_a = reverse(str_a);
rev_b = reverse(str_b);
max_len = max(a_len, b_len);
* INITIALIZE A+B ARRAYS;
do cnt=41-(max_len) to 40;
a[cnt] = input(substr(rev_a,41-cnt,1),best.);
b[cnt] = input(substr(rev_b,41-cnt,1),best.);
if debug then do;
put cnt= max_len= a[cnt] b[cnt];
end;
end;
* CALCULATE BASE;
do cnt=41-(max_len) to 40;
product = a[cnt] * b[cnt];
left = cnt*2-1;
right = cnt*2-0;
base[left] = floor(product/10);
base[right] = mod(product,10);
if debug then do;
put cnt= a[cnt] b[cnt] product= base[left] base[right] left= right=;
end;
end;
* CALCULATE PYRAMID. BOTTOM (LVL=40) OF PYRAMID IS POINTY. REMEMBER EACH LEVEL IS 2 BLOCKS HIGH (MIDDLE DIMENSION OF ARRAY).;
do lvl=1 to max_len-1;
start_cell = 81-(max_len*2)+lvl;
end_cell = 80-lvl;
cnt_start = 41-(max_len);
cnt_end = 40;
cnt = cnt_start;
iter = 0;
do while (cnt + lvl le cnt_end); * CALCULATE THE PRODUCTS;
product1 = a[cnt]*b[cnt+lvl];
product2 = b[cnt]*a[cnt+lvl];
cell1 = start_cell+(iter*2);
cell2 = cell1+1;
p[lvl,1,cell1] = floor(product1/10);
p[lvl,1,cell2] = mod(product1,10);
p[lvl,2,cell1] = floor(product2/10);;
p[lvl,2,cell2] = mod(product2,10);;
if debug then do;
put lvl= cnt_start= cnt_end= cnt= product1= product2= start_cell= end_cell= cell1= cell2= iter=;
end;
cnt = cnt + 1;
iter = iter + 1;
end;
end;
* CALCULATE RESULT. SIMPLY PERFORM ADDITION USING THE BASE AND THE PYRAMID ELEMENTS (BY COLUMN POSITION). CARRY FOR ANY NUMBERS >= 10.;
carry = 0;
do cell=80 to 81-(max_len*2) by -1;
r[cell] = sum(base[cell], carry);
do lvl=1 to max_len-1;
do row=1 to 2;
r[cell] = sum(0,r[cell],p[lvl,row,cell]);
end;
end;
carry = floor(r[cell]/10);
r[cell] = mod(r[cell],10);
end;
* OUTPUT THE FINAL CALCULATIONS TO THE LOG;
if debug eq 1 then do;
option linesize=200;
put a[*];
put b[*];
put ' ';
put base[*];
put ' ';
do lvl=1 to max_len-1;
do row=1 to 2;
do cell=1 to 80;
put p[lvl,row,cell] z1. #;
end;
put ;
end;
end;
put ' ';
put r[*];
end;
* TRIM MISSING VALUES AND ZEROES FROM THE FRONT OF THE RESULT.;
result = cats(of r[*]);
do first_non_zero=1 to 80;
if substr(result, first_non_zero, 1) not in (0,.) then do;
leave;
end;
end;
result = cats(substr(result,min(first_non_zero,80),80));
return(result);
endsub;
run;
#2 - The JAVA solution
Much more concise =P. Create a file called test.java containing the following:
import java.math.BigInteger;
import java.util.UUID;
public class test {
public static void main(String[] args)
{
int reps = 1000000;
for (int i = 0; i < reps; i++) {
String s = UUID.randomUUID().toString().replaceAll("\\-", "");
BigInteger big = new BigInteger(s, 16);
/*System.out.println(s);*/
/*System.out.println(big);*/
}
}
}
Compile it from the command line using javac test.java then execute it by java test.
#3 - proc fcmp+Python
This is the nicest approach IMO. The code is fast-enough for most use-cases, it's simple to call and can be run from inside a datastep, and works on both UUID formatted strings and 32 character hex strings.
If you're running a SAS version below SAS9.4M7 then you WILL need to define the MASM2PATH and MAS_PYPATH environment variables in the appropriate sasv9.cfg / sasv9_usermods.cfg file(s).
proc fcmp outlib=work.funcs.funcs;
/* SYSTEM REQUIREMENTS: SAS94M6. PYTHON2.7 OR ABOVE. MAS_M2PATH AND MAS_PYPATH MUST BE SET AS SYSTEM ENVIRONMENT VARIABLES */
function uuid_to_decimal(str $) $40;
* CREATE THE INNER PYTHON FUNCTION CALL;
declare object py(python);
submit into py;
def u2d( s ):
"Output: u2d_output"
import uuid
return str(uuid.UUID(s).int)
endsubmit;
rc = py.publish(); * PUBLISH THE CODE TO THE PYTHON INTERPRETER;
rc = py.call("u2d", str); * CALL THE PYTHON FUNCTION FROM SAS;
return(py.results["u2d_output"]);
endsub;
run;
option cmplib=work.funcs;
data _null_;
length test1 test2 $40;
do cnt=1 to 500000;
test1 = uuid_to_decimal("a341b0d0-1ec5-11eb-b64d-02b1159b38e9");
test2 = uuid_to_decimal("a341b0d01ec511ebb64d02b1159b38e9");
end;
put test1= test2=;
run;
#4 - proc groovy (thanks to #Richard for this solution)
A faster solution than the FCMP/Python solution but the big downside to me is that you need to dump the values to convert then import and join them back to the original dataset afterwards. The hidden performance cost will then depend on the size and width of the dataset you are working with.
%let work = %sysfunc(pathname(work));
data have;
input hexstr $32.;
do i = 1 to 1000000;
output;
end;
drop i;
datalines;
81A8221704654BAC9B24CDFE5A834540
;
run;
proc export data=have replace outfile="&work/have.txt" dbms=tab;
putnames=no;
run;
proc groovy;
submit "&work"; * PASS IN OUR WORK PATH AS THE FIRST PARAMETER;
File out = new File(args[0]+"/want.txt");
out.write "hexstr,decstr\n";
new File (args[0]+"/have.txt").eachLine { line ->
out.append(line+",\""+new BigInteger(line, 16).toString()+"\"\n");
}
endsubmit;
quit;
options source;
proc import datafile="&work/want.txt" replace out=want dbms=csv;
run;
Could you please help me.
I would like to generate a random array from 0 to 5 and I'm using this function
rand_num = int(ranuni(0)*5+1)
But I would like to generate a random array with a nonrecurrent elements.
For example (1,2,3,4,5) (3,1,5,4,2) etc..
How I can do it?
Thank You!
/* draw with repetition */
data a;
array rand(5);
do i = 1 to dim(rand);
rand(i) = int(ranuni(0)*5+1);
end;
keep rand:;
run;
/* draw without repetition */
data a;
array rand(5);
do i = 1 to dim(rand);
do until(rand(i) ^= .);
value = int(ranuni(0)*5+1);
if value not in rand then
rand(i) = value;
end;
end;
keep rand:;
run;
I think call ranperm is the better solution for this, though both seem to have roughly the same statistical properties. Here's a solution using that (very similar to what Keith pointed to in #data_null_'s solution on another question):
data want;
array rand_array[5];
*initialize the array (once);
do _i = 1 to dim(rand_array);
rand_array[_i]=_i;
end;
*seed for the RNG;
seed=5;
*randomize;
*each time `call ranperm` is used, this shuffles the group;
do _i = 1 to 1e5;
call ranperm(seed,of rand_array[*]);
output;
end;
run;
I am working on a piece of code to take an array of dates and then out put an array of all dates that are within a given buffer of the dates within the original dates vector.
My plan is to use 2 nested do loops to loop through the original array of dates and each time add/subtract the buffer and then use data set to add these two observations to the original set.
I have been using the following code, but I end up in an infinite loop and SAS crashes.
%let buffer = 3;
data dates_with_buffer;
do i = -1*&buffer. to &buffer.;
do j = 1 to 14;
set original_dates point = j;
output_dates = dates + &buffer.;
output;
end;
end;
run;
When you use point= on a set statement, you need to include a stop statement as well to prevent an infinite loop. Try this:
%let buffer = 3;
data dates_with_buffer;
do i = -1*&buffer. to &buffer.;
do j = 1 to 14;
set original_dates point = j;
output_dates = dates + &buffer.;
output;
end;
end;
stop;
run;
I have a SAS dataset, with the following variables: ID, Var1_0, Var1_3, Var1_6, Var2_0, Var2_3, Var2_6, which can be read like this: Var1_0 is parameter 1 at time 0. For every subjects I have 2 variables and 3 time points. I want to transpose this into a long format using an array, I did this:
data long;
set wide;
array a Var1_0 Var1_3 Var1_6 Var2_0 Var_3 Var_6;
Do i=1 to dim(a);
outcome = a[i];
*Var = ???;
if (mod(i,3)=1) then Time = 0;
else if (mod(i,3)=2) then Time = 3;
else Time = 6;
Output;
end;
keep ID Outcome Time;
run;
The problem is that I don't know how to calculate the parameter variable, i.e., I want to add a variables that is either 1 or 2, depending on which parameter the value is related to. Is there a better way of doing this? Thank you !
Reeza gave you the answer in her comment. Here it is typed out.
data long;
set wide;
array a[*] Var1_0 Var1_3 Var1_6 Var2_0 Var2_3 Var2_6;
do i=1 to dim(a);
outcome = a[i];
var = vname(a[i]);
time = input(scan(var,2,'_'),best.);
/*Other stuff you want to do*/
output;
end;
run;
VNAME(array[sub]) gives you the variable name of the variable referenced by array[sub].
scan(str,i,delim) gives you the ith word in str using the specified delimiter.