Overview:
What I have: 81A8221704654BAC9B24CDFE5A834541 (hex string)
What I want: 172343408754022659539630353407870715201 (decimal equivalent as string)
Detail:
I am exporting data from SAS to a 3rd party platform that performs better when the unique key is in numeric format. The key to the data I'm using is currently stored as a 32 char string containing a hex value, e.g. 81A8221704654BAC9B24CDFE5A834541.
The issue I'm facing is that the largest integer value available in SAS on Linux is 9007199254740992 in decimal (or 20000000000000 in hex). The 32 character hex value when converted to decimal will be much larger and any attempt to do conversion using numeric values will lose precision.
The only two approaches I can think of, are to either write some functions that do addition of two strings (ugly+slow), or perhaps someone has a clever way using some bitwise logical functions?
Any suggestions?
You can use Java BigInteger.
Groovy
Write data to file and read conversion back
data have;
input hexstr $32.;
datalines;
81A8221704654BAC9B24CDFE5A834540
81A8221704654BAC9B24CDFE5A834541
81A8221704654BAC9B24CDFE5A834542
81A8221704654BAC9B24CDFE5A834543
;
proc export data=have replace outfile='c:\temp\have-hexstr.dat' dbms=tab;
putnames=no;
run;
proc groovy;
submit;
File out = new File("c:\\temp\\want-decstr.dat");
out.write "hexstr,decstr\n";
new File ("c:\\temp\\have-hexstr.dat").eachLine { line ->
out.append(line+",\""+new BigInteger(line, 16).toString()+"\"\n");
}
endsubmit;
quit;
options source;
proc import datafile='c:\temp\want-decstr.dat' replace out=want dbms=csv;
run;
JavaObj
You could create a jar that contains a Converter class for performing the conversion and call the method from a JavaObj. The jar would have to be added to the SAS session classpath at startup.
public class Converter
{
public string hexStrToDecStr(hexStr)
{
return new java.math.BigInteger(hexStr,16).toString();
}
}
data want;
set have;
if _n_ = 1 then do;
declare javaobj j('Converter');
end;
length decStr $50;
j.callStringMethod('hexStrToDecStr', hexStr, decStr);
run;
Ideal JavaObj
Can't do this because JavaObj can only pass double values to methods :(
data want;
set have;
length decStr $50;
* wont work, cause 16 passed as double! :(;
declare javaobj j ('java/math/BigInteger', hexStr, 16);
j.callStringMethod('toString', decStr);
j.delete();
run;
Ok, here goes... I tested four approaches:
Ugly+Slow. My own SAS code that implements the ability to convert a 32 char string containing a hex value into a very big integer.
Process externally to SAS then join the data back. I chose Java for it's ubiquity as well as its existing support for BigIntegers, and the resulting simple code.
Run some Python code from inside a SAS proc fcmp function to perform the conversion. The FCMP function can then be used within a datastep.
(per Richard's answer) Use proc groovy to perform the conversion.
Results:
My SAS implementation took ~15 minutes to take the same hex string 1
million times and convert it to a big integer (stored as a string).
The java implementation took ~3 seconds to generate 1 million hex
strings and convert them to big integers. This doesn't include any additional time that would be required to export the data, and join it back.
The proc fcmp + python solution took ~90 seconds to process 1 million UUIDs or hex string equivalents and convert them all within a SAS datastep.
Richard's proc groovy solution took ~10 seconds to process 1 million observations including writing them to disk and reading them back in. Joining these back to a wider dataset would incurr additional time/IO cost.
With the huge disparity in timing, the vanilla SAS option is clearly not the solution use unless you don't have the ability to implement any of the other solutions. The one advantage it does have is that it's all done natively within SAS - so if your admins have your system on lockdown, you can still achieve the conversion, albeit slower.
Between the other 3 solutions, I think the best solution depends on your use case:
The proc fcmp solution has the advantage that it makes for cleaner / self-contained SAS code that's easy to maintain/read/use and runs from within a datastep (something none of the other solutions can claim), but at the cost of running a little slower. I think this would be my preferred choice if I wasn't dealing with big-data performance issues.
The proc groovy solution is also nice in that all of the code can live within SAS and no external programs/JAR files are required. The downside is that it does incur the additional complexity of having to dump the hex values to disk, convert them, read them back in, and finally join back to the original dataset. Even with the additional steps, depending on the size/width of the original dataset, this is likely to be faster than the above 2 options.
The Java solution to me doesn't offer any benefits over the proc groovy solution . It may run marginally faster (untested), but even if it did, it requires compilation of the JAR file, and existing external .java source file, etc.. so leaves a bit more of a mess lying around.
Disclaimer:
My expertise isn't low level computer science coding - I'm sure there's ways to improve my vanilla SAS code approach but I don't know how so if you can suggest any improvements feel free to chime in.
#1 - SAS Implementation
The approach I took in SAS was as follows:
Slice the 32 character have string into 4 8-character strings.
Convert each of these 8 character strings from hex to decimal. These are small enough that SAS is able to store them as numeric values without losing precision and we can make use of the existing hex informat.
Multiply each of these 4 numerics by the appropriate base-16 exponents to 'resize' them accordingly. The exponents were precalculated outside of SAS using a big number calculator as 16^16 and 16^24 are too big for SAS to handle. To multiply these numbers together I created an FCMP function called multiply_strings.
Add the 4 resulting strings from step 3 together using an FCMP function I created called add_strings.
Below is some test code that performs the above steps and shows the workings, the steps are commented:
data _null_;
length front back final $80;
* CALCULATED USING A BIG NUMBER CALCULATOR. DONT PERFORM THESE CALCULATIONS IN SAS;
exp1 = '1'; * 16 ** 0;
exp2 = '4294967296'; * 16 ** 8;
exp3 = '18446744073709551616'; * 16 ** 16;
exp4 = '79228162514264337593543950336'; * 16 ** 24;
have = '81A8221704654BAC9B24CDFE5A834541';
correct = '172343408754022659539630353407870715201'; * TESTED USING THE SEARCH TERM: 81A8221704654BAC9B24CDFE5A834541 hex as decimal ;
* STEP1;
hex1 = upcase(substr(have,25,8));
hex2 = upcase(substr(have,17,8));
hex3 = upcase(substr(have, 9,8));
hex4 = upcase(substr(have, 1,8));
*STEP2;
part1 = input(hex1,hex8.);
part2 = input(hex2,hex8.);
part3 = input(hex3,hex8.);
part4 = input(hex4,hex8.);
*STEP3;
product1 = part1;
product2 = multiply_strings(put(part2,32.0), exp2, 0);
product3 = multiply_strings(put(part3,32.0), exp3, 0);
product4 = multiply_strings(put(part4,32.0), exp4, 0);
* STEP4;
* SHOULD CHANGE ADD STRINGS TO ACCEPT 4 PARAMS BUT WILL SEE HOW PERFORMANCE COMPARES FIRST;
front = add_strings(product1, product2);
back = add_strings(product3, product4);
final = add_strings(front , back );
put #3 final=;
put correct=;
put '-';
put have=;
put hex1=;
put hex2=;
put hex3=;
put hex4=;
put '-';
put part1=;
put part2=;
put part3=;
put part4=;
put '-';
put product1=;
put product2=;
put product3=;
put product4=;
put '-';
put exp1=;
put exp2=;
put exp3=;
put exp4=;
%runquit;
And the results from the above step:
final=172343408754022659539630353407870715201
correct=172343408754022659539630353407870715201
-
have=81A8221704654BAC9B24CDFE5A834541
hex1=5A834541
hex2=9B24CDFE
hex3=04654BAC
hex4=81A82217
-
part1=1518552385
part2=2602880510
part3=73747372
part4=2175279639
-
product1=1518552385
product2=11179286665845800960
product3=1360398897392653722978353152
product4=172343408752662260631058413017528008704
-
exp1=1
exp2=4294967296
exp3=18446744073709551616
exp4=79228162514264337593543950336
The code I used to run it 1 million times was this (I smashed it all together on the off-chance it would run faster, it didn't):
data _null_;
length front back final $80;
* CALCULATED USING A BIG NUMBER CALCULATOR. DONT PERFORM THESE CALCULATIONS IN SAS;
part1_exp = '1'; * 16 ** 0;
part2_exp = '4294967296'; * 16 ** 8;
part3_exp = '18446744073709551616'; * 16 ** 16;
part4_exp = '79228162514264337593543950336'; * 16 ** 24;
have = '81A8221704654BAC9B24CDFE5A834541';
correct = '172343408754022659539630353407870715201'; * TESTED USING THE SEARCH TERM: 81A8221704654BAC9B24CDFE5A834541 hex as decimal ;
do blah=1 to 1000000;
front = add_strings(input(upcase(substr(have,25,8)),hex8.), multiply_strings(put(input(upcase(substr(have,17,8)),hex8.),32.0), part2_exp, 0));
back = add_strings(multiply_strings(put(input(upcase(substr(have, 9,8)),hex8.),32.0), part3_exp, 0), multiply_strings(put(input(upcase(substr(have, 1,8)),hex8.),32.0), part4_exp, 0));
final = add_strings(front , back );
end;
run;
FCMP function: add_strings()
This function simply performs addition as you would write it out by hand. Obviously we're trying to avoid converting any significant part of it into a SAS numeric. Perhaps chunking it and using numeric addition would be faster but I'd already spent enough time writing it this way and time constraints etc.....
proc fcmp outlib=work.funcs.funcs;
function add_strings(str1 $, str2 $ ) $;
length result rev1 rev2 $40 digit $1;
len = max(lengthn(cats(str1)), lengthn(cats(str2)));
rev1 = cats(reverse(str1));
rev2 = cats(reverse(str2));
max_result_length = len + 1;
carry = 0;
result = '';
do cnt=1 to max_result_length;
s = sum(input(substr(rev1,cnt,1),best.), input(substr(rev2,cnt,1),best.), carry);
digit = mod(s,10);
carry = floor(s/10);
result = cats(digit, result);
end;
/* REMOVE LEADING ZEROS */
result_len = lengthn(cats(result));
do first_non_zero=1 to result_len;
if substr(result, first_non_zero, 1) ne '0' then do;
leave;
end;
end;
result = cats(substr(result,min(first_non_zero,result_len),result_len));
return(result);
endsub;
run;
FCMP function: multiply_strings()
This function multiplies two large strings containing very big integers together... It should be accurate to any number where the result is no more than 80 characters but I haven't tested the boundary conditions. The third option is simply to print debugging information to the log, set to 1 to print it.
This was a little trickier than the add_strings() function as typically multiplication by hand (at least how I do it) requires you to add numbers together as part of the process. We want to avoid doing this as these numbers will be very large and we would have had to use the add_strings() function to do it (which I already knew was slow from testing).
Instead I found this approach to multiplication which would avoid having to add any numbers together that would be too large for SAS to store: https://www.youtube.com/watch?v=b_xUE4wkVKY . I don't know if this approach has a proper name or not, I couldn't find a lot of information on it, my google-fu was lacking.
proc fcmp outlib=work.funcs.funcs;
function multiply_strings(str_a $, str_b $, debug ) $;
length result $80 rev_a rev_b $40 digit $1;
array a [40] a1-a40; * STR_A AS AN ARRAY;
array b [40] b1-b40; * STR_B AS AN ARRAY;
array base [80] base1-base80; * PRODUCT OF STR_A AND STR_B ELEMENTS.;
array r [80] r1-r80; * RESULT;
**
** CANT DYNAMICALLY SIZE ARRAY P UNFORTUNATELY (AFAIK).
** LOTS OF WASTED SPACE AND PROBABLY SLOWER BUT EASIER TO CODE/DEBUG I SUPPOSE.
** 39 LEVELS. 2 ROWS PER LEVEL. 80 CELLS PER ROW.
*;
array p [39,2,80] p1-p6240;
a_len = lengthn(str_a);
b_len = lengthn(str_b);
rev_a = reverse(str_a);
rev_b = reverse(str_b);
max_len = max(a_len, b_len);
* INITIALIZE A+B ARRAYS;
do cnt=41-(max_len) to 40;
a[cnt] = input(substr(rev_a,41-cnt,1),best.);
b[cnt] = input(substr(rev_b,41-cnt,1),best.);
if debug then do;
put cnt= max_len= a[cnt] b[cnt];
end;
end;
* CALCULATE BASE;
do cnt=41-(max_len) to 40;
product = a[cnt] * b[cnt];
left = cnt*2-1;
right = cnt*2-0;
base[left] = floor(product/10);
base[right] = mod(product,10);
if debug then do;
put cnt= a[cnt] b[cnt] product= base[left] base[right] left= right=;
end;
end;
* CALCULATE PYRAMID. BOTTOM (LVL=40) OF PYRAMID IS POINTY. REMEMBER EACH LEVEL IS 2 BLOCKS HIGH (MIDDLE DIMENSION OF ARRAY).;
do lvl=1 to max_len-1;
start_cell = 81-(max_len*2)+lvl;
end_cell = 80-lvl;
cnt_start = 41-(max_len);
cnt_end = 40;
cnt = cnt_start;
iter = 0;
do while (cnt + lvl le cnt_end); * CALCULATE THE PRODUCTS;
product1 = a[cnt]*b[cnt+lvl];
product2 = b[cnt]*a[cnt+lvl];
cell1 = start_cell+(iter*2);
cell2 = cell1+1;
p[lvl,1,cell1] = floor(product1/10);
p[lvl,1,cell2] = mod(product1,10);
p[lvl,2,cell1] = floor(product2/10);;
p[lvl,2,cell2] = mod(product2,10);;
if debug then do;
put lvl= cnt_start= cnt_end= cnt= product1= product2= start_cell= end_cell= cell1= cell2= iter=;
end;
cnt = cnt + 1;
iter = iter + 1;
end;
end;
* CALCULATE RESULT. SIMPLY PERFORM ADDITION USING THE BASE AND THE PYRAMID ELEMENTS (BY COLUMN POSITION). CARRY FOR ANY NUMBERS >= 10.;
carry = 0;
do cell=80 to 81-(max_len*2) by -1;
r[cell] = sum(base[cell], carry);
do lvl=1 to max_len-1;
do row=1 to 2;
r[cell] = sum(0,r[cell],p[lvl,row,cell]);
end;
end;
carry = floor(r[cell]/10);
r[cell] = mod(r[cell],10);
end;
* OUTPUT THE FINAL CALCULATIONS TO THE LOG;
if debug eq 1 then do;
option linesize=200;
put a[*];
put b[*];
put ' ';
put base[*];
put ' ';
do lvl=1 to max_len-1;
do row=1 to 2;
do cell=1 to 80;
put p[lvl,row,cell] z1. #;
end;
put ;
end;
end;
put ' ';
put r[*];
end;
* TRIM MISSING VALUES AND ZEROES FROM THE FRONT OF THE RESULT.;
result = cats(of r[*]);
do first_non_zero=1 to 80;
if substr(result, first_non_zero, 1) not in (0,.) then do;
leave;
end;
end;
result = cats(substr(result,min(first_non_zero,80),80));
return(result);
endsub;
run;
#2 - The JAVA solution
Much more concise =P. Create a file called test.java containing the following:
import java.math.BigInteger;
import java.util.UUID;
public class test {
public static void main(String[] args)
{
int reps = 1000000;
for (int i = 0; i < reps; i++) {
String s = UUID.randomUUID().toString().replaceAll("\\-", "");
BigInteger big = new BigInteger(s, 16);
/*System.out.println(s);*/
/*System.out.println(big);*/
}
}
}
Compile it from the command line using javac test.java then execute it by java test.
#3 - proc fcmp+Python
This is the nicest approach IMO. The code is fast-enough for most use-cases, it's simple to call and can be run from inside a datastep, and works on both UUID formatted strings and 32 character hex strings.
If you're running a SAS version below SAS9.4M7 then you WILL need to define the MASM2PATH and MAS_PYPATH environment variables in the appropriate sasv9.cfg / sasv9_usermods.cfg file(s).
proc fcmp outlib=work.funcs.funcs;
/* SYSTEM REQUIREMENTS: SAS94M6. PYTHON2.7 OR ABOVE. MAS_M2PATH AND MAS_PYPATH MUST BE SET AS SYSTEM ENVIRONMENT VARIABLES */
function uuid_to_decimal(str $) $40;
* CREATE THE INNER PYTHON FUNCTION CALL;
declare object py(python);
submit into py;
def u2d( s ):
"Output: u2d_output"
import uuid
return str(uuid.UUID(s).int)
endsubmit;
rc = py.publish(); * PUBLISH THE CODE TO THE PYTHON INTERPRETER;
rc = py.call("u2d", str); * CALL THE PYTHON FUNCTION FROM SAS;
return(py.results["u2d_output"]);
endsub;
run;
option cmplib=work.funcs;
data _null_;
length test1 test2 $40;
do cnt=1 to 500000;
test1 = uuid_to_decimal("a341b0d0-1ec5-11eb-b64d-02b1159b38e9");
test2 = uuid_to_decimal("a341b0d01ec511ebb64d02b1159b38e9");
end;
put test1= test2=;
run;
#4 - proc groovy (thanks to #Richard for this solution)
A faster solution than the FCMP/Python solution but the big downside to me is that you need to dump the values to convert then import and join them back to the original dataset afterwards. The hidden performance cost will then depend on the size and width of the dataset you are working with.
%let work = %sysfunc(pathname(work));
data have;
input hexstr $32.;
do i = 1 to 1000000;
output;
end;
drop i;
datalines;
81A8221704654BAC9B24CDFE5A834540
;
run;
proc export data=have replace outfile="&work/have.txt" dbms=tab;
putnames=no;
run;
proc groovy;
submit "&work"; * PASS IN OUR WORK PATH AS THE FIRST PARAMETER;
File out = new File(args[0]+"/want.txt");
out.write "hexstr,decstr\n";
new File (args[0]+"/have.txt").eachLine { line ->
out.append(line+",\""+new BigInteger(line, 16).toString()+"\"\n");
}
endsubmit;
quit;
options source;
proc import datafile="&work/want.txt" replace out=want dbms=csv;
run;
I am working on a piece of code to take an array of dates and then out put an array of all dates that are within a given buffer of the dates within the original dates vector.
My plan is to use 2 nested do loops to loop through the original array of dates and each time add/subtract the buffer and then use data set to add these two observations to the original set.
I have been using the following code, but I end up in an infinite loop and SAS crashes.
%let buffer = 3;
data dates_with_buffer;
do i = -1*&buffer. to &buffer.;
do j = 1 to 14;
set original_dates point = j;
output_dates = dates + &buffer.;
output;
end;
end;
run;
When you use point= on a set statement, you need to include a stop statement as well to prevent an infinite loop. Try this:
%let buffer = 3;
data dates_with_buffer;
do i = -1*&buffer. to &buffer.;
do j = 1 to 14;
set original_dates point = j;
output_dates = dates + &buffer.;
output;
end;
end;
stop;
run;
I have a SAS dataset, with the following variables: ID, Var1_0, Var1_3, Var1_6, Var2_0, Var2_3, Var2_6, which can be read like this: Var1_0 is parameter 1 at time 0. For every subjects I have 2 variables and 3 time points. I want to transpose this into a long format using an array, I did this:
data long;
set wide;
array a Var1_0 Var1_3 Var1_6 Var2_0 Var_3 Var_6;
Do i=1 to dim(a);
outcome = a[i];
*Var = ???;
if (mod(i,3)=1) then Time = 0;
else if (mod(i,3)=2) then Time = 3;
else Time = 6;
Output;
end;
keep ID Outcome Time;
run;
The problem is that I don't know how to calculate the parameter variable, i.e., I want to add a variables that is either 1 or 2, depending on which parameter the value is related to. Is there a better way of doing this? Thank you !
Reeza gave you the answer in her comment. Here it is typed out.
data long;
set wide;
array a[*] Var1_0 Var1_3 Var1_6 Var2_0 Var2_3 Var2_6;
do i=1 to dim(a);
outcome = a[i];
var = vname(a[i]);
time = input(scan(var,2,'_'),best.);
/*Other stuff you want to do*/
output;
end;
run;
VNAME(array[sub]) gives you the variable name of the variable referenced by array[sub].
scan(str,i,delim) gives you the ith word in str using the specified delimiter.
I've been trying to find the simplest way to generate simulated time series datasets in SAS. I initially was experimenting with the LAG operator, but this requires input data, so is proabably not the best way to go. (See this question: SAS: Using the lag function without a set statement (to simulate time series data.))
Has anyone developed a macro or dataset that enables time series to be genereated with an arbitrary number of AR and MA terms? What is the best way to do this?
To be specific, I'm looking to generate what SAS calls an ARMA(p,q) process, where p denotes the autoregressive component (lagged values of the dependent variable), and q is the moving average component (lagged values of the error term).
Thanks very much.
I have developed a macro to attempt to answer this question, but I'm not sure whether this is the most efficient way of doing this. Anyway, I thought it might be useful to someone:
%macro TimeSeriesSimulation(numDataPoints=100, model=y=e,outputDataSetName=ts, maxLags=10);
data &outputDataSetName (drop=j);
array lagy(&maxlags) _temporary_;
array lage(&maxlags) _temporary_;
/*Initialise values*/
e = 0;
y=0;
t=1;
do j = 1 to 10;
lagy(j) = 0;
lage(j) = 0;
end;
output;
do t = 2 to &numDataPoints; /*Change this for number of observations*/
/*SPECIFY MODEL HERE*/
e = rannorm(-1); /*Draw from a N(0,1)*/
&model;
/*Update values of lags on the moving average and autoregressive terms*/
do j = &maxlags-1 to 1 by -1; /*Note you have to do this backwards because otherwise you cascade the current value to all past values!*/
lagy(j+1) = lagy(j);
lage(j+1) = lage(j);
end;
lagy(1) = y;
lage(1) = e;
output;
end;
run;
%mend;
/*Example 1: Unit root*/
%TimeSeriesSimulation(numDataPoints=1000, model=y=lagy(1)+e)
/*Example 2: Simple process with AR and MA components*/
%TimeSeriesSimulation(numDataPoints=1000, model=y=0.5*lagy(1)+0.5*lage(1)+e)
I've got something like the following:
proc means data = ... missing;
class 1 2 3 4 5;
var a b;
output sum=;
run;
This does what I want it to do, except for the fact that it is very difficult to differentiate between a missing value that represents a total, and a missing value that represents a missing value. For example, the following would appear in my output:
1 2 3 4 5 type sumA sumB
. . . . . 0 num num
. . . . . 1 num num
Ways I can think of handling this:
1) Change missings to a cardinal value prior to proc means. This is definitely doable...but not exactly clean.
2) Format the missings to something else prior, and then use preloadfmt? This is a bit of a pain...I'd really rather not.
3) Somehow use the proc means-generated variable type to determine whether the missing is a missing or a total
4) Other??
I feel like this is clearly a common enough problem that there must be a clean, easy way, and I just don't know what it is.
Option 3, for sure . Type is simply a binary number with 1 for each class variable, in order, that is included in the current row and 0 for each one that is missing. You can use the CHARTYPE option to ask for it to be given explicitly as a string ('01101110' etc.), or work with math if that's more your style.
How exactly you use this depends on what you're trying to accomplish. Rows that have a missing value on them will have a type that suggests a class variable should exist, but doesn't. So for example:
data want;
set have; *post-proc means assuming used CHARTYPE option;
array classvars a b c d e; *whatever they are;
hasmissing=0;
do _t = 1 to dim(classvars);
if char(_type_,_t) = 1 and classvars[_t] = . then hasmissing=1;
end;
*or;
if cmiss(of classvars[*]) = countc(_type_,'0') then hasmissing=0;
else hasmissing=1; *number of 0s = number of missings = total row, otherwise not;
run;
That's a brute force application, of course. You may also be able to identify it based on the number of missings, if you have a small number of types requested. For example, let's say you have 3 class variables (so 0 to 7 values for type), and you only asked for the 3 way combination (7, '111') and the 3 two way combination 'totals' (6,5,3, ie, '110','101','011'). Then:
data want;
set have;
if (_type_=7 and cmiss(of a b c) = 0) or (cmiss(of a b c) = 1) then ... ; *either base row or total row, no missings;
else ... ; *has at least one missing;
run;
Depending on your data, NMISS may also work. That checks to see if the number of missings is appropriate for the type of data.
Joe's strategy, modified slightly for my exact problem, because it may be useful to somebody at some point in the future.
data want;
set have;
array classvars a b c d e;
do _t = 1 to dim(classvars);
if char(_type_,_t) = 1 and (strip(classvars[_t] = "") or strip(classvars[_t]) = ".") then classvars[_t] = "TOTAL";
end;
run;
The rationale for the changes is as follows:
1) I'm working with (mostly) character variables, not numeric.
2) I'm not interested in whether a row has any missing or not, as those are very frequent, and I want to keep them. Instead, I just want the output to differentiate between the missings and the totals, which I have accomplished by renaming the instances of non-missing to something that indicates total.