Overview:
What I have: 81A8221704654BAC9B24CDFE5A834541 (hex string)
What I want: 172343408754022659539630353407870715201 (decimal equivalent as string)
Detail:
I am exporting data from SAS to a 3rd party platform that performs better when the unique key is in numeric format. The key to the data I'm using is currently stored as a 32 char string containing a hex value, e.g. 81A8221704654BAC9B24CDFE5A834541.
The issue I'm facing is that the largest integer value available in SAS on Linux is 9007199254740992 in decimal (or 20000000000000 in hex). The 32 character hex value when converted to decimal will be much larger and any attempt to do conversion using numeric values will lose precision.
The only two approaches I can think of, are to either write some functions that do addition of two strings (ugly+slow), or perhaps someone has a clever way using some bitwise logical functions?
Any suggestions?
You can use Java BigInteger.
Groovy
Write data to file and read conversion back
data have;
input hexstr $32.;
datalines;
81A8221704654BAC9B24CDFE5A834540
81A8221704654BAC9B24CDFE5A834541
81A8221704654BAC9B24CDFE5A834542
81A8221704654BAC9B24CDFE5A834543
;
proc export data=have replace outfile='c:\temp\have-hexstr.dat' dbms=tab;
putnames=no;
run;
proc groovy;
submit;
File out = new File("c:\\temp\\want-decstr.dat");
out.write "hexstr,decstr\n";
new File ("c:\\temp\\have-hexstr.dat").eachLine { line ->
out.append(line+",\""+new BigInteger(line, 16).toString()+"\"\n");
}
endsubmit;
quit;
options source;
proc import datafile='c:\temp\want-decstr.dat' replace out=want dbms=csv;
run;
JavaObj
You could create a jar that contains a Converter class for performing the conversion and call the method from a JavaObj. The jar would have to be added to the SAS session classpath at startup.
public class Converter
{
public string hexStrToDecStr(hexStr)
{
return new java.math.BigInteger(hexStr,16).toString();
}
}
data want;
set have;
if _n_ = 1 then do;
declare javaobj j('Converter');
end;
length decStr $50;
j.callStringMethod('hexStrToDecStr', hexStr, decStr);
run;
Ideal JavaObj
Can't do this because JavaObj can only pass double values to methods :(
data want;
set have;
length decStr $50;
* wont work, cause 16 passed as double! :(;
declare javaobj j ('java/math/BigInteger', hexStr, 16);
j.callStringMethod('toString', decStr);
j.delete();
run;
Ok, here goes... I tested four approaches:
Ugly+Slow. My own SAS code that implements the ability to convert a 32 char string containing a hex value into a very big integer.
Process externally to SAS then join the data back. I chose Java for it's ubiquity as well as its existing support for BigIntegers, and the resulting simple code.
Run some Python code from inside a SAS proc fcmp function to perform the conversion. The FCMP function can then be used within a datastep.
(per Richard's answer) Use proc groovy to perform the conversion.
Results:
My SAS implementation took ~15 minutes to take the same hex string 1
million times and convert it to a big integer (stored as a string).
The java implementation took ~3 seconds to generate 1 million hex
strings and convert them to big integers. This doesn't include any additional time that would be required to export the data, and join it back.
The proc fcmp + python solution took ~90 seconds to process 1 million UUIDs or hex string equivalents and convert them all within a SAS datastep.
Richard's proc groovy solution took ~10 seconds to process 1 million observations including writing them to disk and reading them back in. Joining these back to a wider dataset would incurr additional time/IO cost.
With the huge disparity in timing, the vanilla SAS option is clearly not the solution use unless you don't have the ability to implement any of the other solutions. The one advantage it does have is that it's all done natively within SAS - so if your admins have your system on lockdown, you can still achieve the conversion, albeit slower.
Between the other 3 solutions, I think the best solution depends on your use case:
The proc fcmp solution has the advantage that it makes for cleaner / self-contained SAS code that's easy to maintain/read/use and runs from within a datastep (something none of the other solutions can claim), but at the cost of running a little slower. I think this would be my preferred choice if I wasn't dealing with big-data performance issues.
The proc groovy solution is also nice in that all of the code can live within SAS and no external programs/JAR files are required. The downside is that it does incur the additional complexity of having to dump the hex values to disk, convert them, read them back in, and finally join back to the original dataset. Even with the additional steps, depending on the size/width of the original dataset, this is likely to be faster than the above 2 options.
The Java solution to me doesn't offer any benefits over the proc groovy solution . It may run marginally faster (untested), but even if it did, it requires compilation of the JAR file, and existing external .java source file, etc.. so leaves a bit more of a mess lying around.
Disclaimer:
My expertise isn't low level computer science coding - I'm sure there's ways to improve my vanilla SAS code approach but I don't know how so if you can suggest any improvements feel free to chime in.
#1 - SAS Implementation
The approach I took in SAS was as follows:
Slice the 32 character have string into 4 8-character strings.
Convert each of these 8 character strings from hex to decimal. These are small enough that SAS is able to store them as numeric values without losing precision and we can make use of the existing hex informat.
Multiply each of these 4 numerics by the appropriate base-16 exponents to 'resize' them accordingly. The exponents were precalculated outside of SAS using a big number calculator as 16^16 and 16^24 are too big for SAS to handle. To multiply these numbers together I created an FCMP function called multiply_strings.
Add the 4 resulting strings from step 3 together using an FCMP function I created called add_strings.
Below is some test code that performs the above steps and shows the workings, the steps are commented:
data _null_;
length front back final $80;
* CALCULATED USING A BIG NUMBER CALCULATOR. DONT PERFORM THESE CALCULATIONS IN SAS;
exp1 = '1'; * 16 ** 0;
exp2 = '4294967296'; * 16 ** 8;
exp3 = '18446744073709551616'; * 16 ** 16;
exp4 = '79228162514264337593543950336'; * 16 ** 24;
have = '81A8221704654BAC9B24CDFE5A834541';
correct = '172343408754022659539630353407870715201'; * TESTED USING THE SEARCH TERM: 81A8221704654BAC9B24CDFE5A834541 hex as decimal ;
* STEP1;
hex1 = upcase(substr(have,25,8));
hex2 = upcase(substr(have,17,8));
hex3 = upcase(substr(have, 9,8));
hex4 = upcase(substr(have, 1,8));
*STEP2;
part1 = input(hex1,hex8.);
part2 = input(hex2,hex8.);
part3 = input(hex3,hex8.);
part4 = input(hex4,hex8.);
*STEP3;
product1 = part1;
product2 = multiply_strings(put(part2,32.0), exp2, 0);
product3 = multiply_strings(put(part3,32.0), exp3, 0);
product4 = multiply_strings(put(part4,32.0), exp4, 0);
* STEP4;
* SHOULD CHANGE ADD STRINGS TO ACCEPT 4 PARAMS BUT WILL SEE HOW PERFORMANCE COMPARES FIRST;
front = add_strings(product1, product2);
back = add_strings(product3, product4);
final = add_strings(front , back );
put #3 final=;
put correct=;
put '-';
put have=;
put hex1=;
put hex2=;
put hex3=;
put hex4=;
put '-';
put part1=;
put part2=;
put part3=;
put part4=;
put '-';
put product1=;
put product2=;
put product3=;
put product4=;
put '-';
put exp1=;
put exp2=;
put exp3=;
put exp4=;
%runquit;
And the results from the above step:
final=172343408754022659539630353407870715201
correct=172343408754022659539630353407870715201
-
have=81A8221704654BAC9B24CDFE5A834541
hex1=5A834541
hex2=9B24CDFE
hex3=04654BAC
hex4=81A82217
-
part1=1518552385
part2=2602880510
part3=73747372
part4=2175279639
-
product1=1518552385
product2=11179286665845800960
product3=1360398897392653722978353152
product4=172343408752662260631058413017528008704
-
exp1=1
exp2=4294967296
exp3=18446744073709551616
exp4=79228162514264337593543950336
The code I used to run it 1 million times was this (I smashed it all together on the off-chance it would run faster, it didn't):
data _null_;
length front back final $80;
* CALCULATED USING A BIG NUMBER CALCULATOR. DONT PERFORM THESE CALCULATIONS IN SAS;
part1_exp = '1'; * 16 ** 0;
part2_exp = '4294967296'; * 16 ** 8;
part3_exp = '18446744073709551616'; * 16 ** 16;
part4_exp = '79228162514264337593543950336'; * 16 ** 24;
have = '81A8221704654BAC9B24CDFE5A834541';
correct = '172343408754022659539630353407870715201'; * TESTED USING THE SEARCH TERM: 81A8221704654BAC9B24CDFE5A834541 hex as decimal ;
do blah=1 to 1000000;
front = add_strings(input(upcase(substr(have,25,8)),hex8.), multiply_strings(put(input(upcase(substr(have,17,8)),hex8.),32.0), part2_exp, 0));
back = add_strings(multiply_strings(put(input(upcase(substr(have, 9,8)),hex8.),32.0), part3_exp, 0), multiply_strings(put(input(upcase(substr(have, 1,8)),hex8.),32.0), part4_exp, 0));
final = add_strings(front , back );
end;
run;
FCMP function: add_strings()
This function simply performs addition as you would write it out by hand. Obviously we're trying to avoid converting any significant part of it into a SAS numeric. Perhaps chunking it and using numeric addition would be faster but I'd already spent enough time writing it this way and time constraints etc.....
proc fcmp outlib=work.funcs.funcs;
function add_strings(str1 $, str2 $ ) $;
length result rev1 rev2 $40 digit $1;
len = max(lengthn(cats(str1)), lengthn(cats(str2)));
rev1 = cats(reverse(str1));
rev2 = cats(reverse(str2));
max_result_length = len + 1;
carry = 0;
result = '';
do cnt=1 to max_result_length;
s = sum(input(substr(rev1,cnt,1),best.), input(substr(rev2,cnt,1),best.), carry);
digit = mod(s,10);
carry = floor(s/10);
result = cats(digit, result);
end;
/* REMOVE LEADING ZEROS */
result_len = lengthn(cats(result));
do first_non_zero=1 to result_len;
if substr(result, first_non_zero, 1) ne '0' then do;
leave;
end;
end;
result = cats(substr(result,min(first_non_zero,result_len),result_len));
return(result);
endsub;
run;
FCMP function: multiply_strings()
This function multiplies two large strings containing very big integers together... It should be accurate to any number where the result is no more than 80 characters but I haven't tested the boundary conditions. The third option is simply to print debugging information to the log, set to 1 to print it.
This was a little trickier than the add_strings() function as typically multiplication by hand (at least how I do it) requires you to add numbers together as part of the process. We want to avoid doing this as these numbers will be very large and we would have had to use the add_strings() function to do it (which I already knew was slow from testing).
Instead I found this approach to multiplication which would avoid having to add any numbers together that would be too large for SAS to store: https://www.youtube.com/watch?v=b_xUE4wkVKY . I don't know if this approach has a proper name or not, I couldn't find a lot of information on it, my google-fu was lacking.
proc fcmp outlib=work.funcs.funcs;
function multiply_strings(str_a $, str_b $, debug ) $;
length result $80 rev_a rev_b $40 digit $1;
array a [40] a1-a40; * STR_A AS AN ARRAY;
array b [40] b1-b40; * STR_B AS AN ARRAY;
array base [80] base1-base80; * PRODUCT OF STR_A AND STR_B ELEMENTS.;
array r [80] r1-r80; * RESULT;
**
** CANT DYNAMICALLY SIZE ARRAY P UNFORTUNATELY (AFAIK).
** LOTS OF WASTED SPACE AND PROBABLY SLOWER BUT EASIER TO CODE/DEBUG I SUPPOSE.
** 39 LEVELS. 2 ROWS PER LEVEL. 80 CELLS PER ROW.
*;
array p [39,2,80] p1-p6240;
a_len = lengthn(str_a);
b_len = lengthn(str_b);
rev_a = reverse(str_a);
rev_b = reverse(str_b);
max_len = max(a_len, b_len);
* INITIALIZE A+B ARRAYS;
do cnt=41-(max_len) to 40;
a[cnt] = input(substr(rev_a,41-cnt,1),best.);
b[cnt] = input(substr(rev_b,41-cnt,1),best.);
if debug then do;
put cnt= max_len= a[cnt] b[cnt];
end;
end;
* CALCULATE BASE;
do cnt=41-(max_len) to 40;
product = a[cnt] * b[cnt];
left = cnt*2-1;
right = cnt*2-0;
base[left] = floor(product/10);
base[right] = mod(product,10);
if debug then do;
put cnt= a[cnt] b[cnt] product= base[left] base[right] left= right=;
end;
end;
* CALCULATE PYRAMID. BOTTOM (LVL=40) OF PYRAMID IS POINTY. REMEMBER EACH LEVEL IS 2 BLOCKS HIGH (MIDDLE DIMENSION OF ARRAY).;
do lvl=1 to max_len-1;
start_cell = 81-(max_len*2)+lvl;
end_cell = 80-lvl;
cnt_start = 41-(max_len);
cnt_end = 40;
cnt = cnt_start;
iter = 0;
do while (cnt + lvl le cnt_end); * CALCULATE THE PRODUCTS;
product1 = a[cnt]*b[cnt+lvl];
product2 = b[cnt]*a[cnt+lvl];
cell1 = start_cell+(iter*2);
cell2 = cell1+1;
p[lvl,1,cell1] = floor(product1/10);
p[lvl,1,cell2] = mod(product1,10);
p[lvl,2,cell1] = floor(product2/10);;
p[lvl,2,cell2] = mod(product2,10);;
if debug then do;
put lvl= cnt_start= cnt_end= cnt= product1= product2= start_cell= end_cell= cell1= cell2= iter=;
end;
cnt = cnt + 1;
iter = iter + 1;
end;
end;
* CALCULATE RESULT. SIMPLY PERFORM ADDITION USING THE BASE AND THE PYRAMID ELEMENTS (BY COLUMN POSITION). CARRY FOR ANY NUMBERS >= 10.;
carry = 0;
do cell=80 to 81-(max_len*2) by -1;
r[cell] = sum(base[cell], carry);
do lvl=1 to max_len-1;
do row=1 to 2;
r[cell] = sum(0,r[cell],p[lvl,row,cell]);
end;
end;
carry = floor(r[cell]/10);
r[cell] = mod(r[cell],10);
end;
* OUTPUT THE FINAL CALCULATIONS TO THE LOG;
if debug eq 1 then do;
option linesize=200;
put a[*];
put b[*];
put ' ';
put base[*];
put ' ';
do lvl=1 to max_len-1;
do row=1 to 2;
do cell=1 to 80;
put p[lvl,row,cell] z1. #;
end;
put ;
end;
end;
put ' ';
put r[*];
end;
* TRIM MISSING VALUES AND ZEROES FROM THE FRONT OF THE RESULT.;
result = cats(of r[*]);
do first_non_zero=1 to 80;
if substr(result, first_non_zero, 1) not in (0,.) then do;
leave;
end;
end;
result = cats(substr(result,min(first_non_zero,80),80));
return(result);
endsub;
run;
#2 - The JAVA solution
Much more concise =P. Create a file called test.java containing the following:
import java.math.BigInteger;
import java.util.UUID;
public class test {
public static void main(String[] args)
{
int reps = 1000000;
for (int i = 0; i < reps; i++) {
String s = UUID.randomUUID().toString().replaceAll("\\-", "");
BigInteger big = new BigInteger(s, 16);
/*System.out.println(s);*/
/*System.out.println(big);*/
}
}
}
Compile it from the command line using javac test.java then execute it by java test.
#3 - proc fcmp+Python
This is the nicest approach IMO. The code is fast-enough for most use-cases, it's simple to call and can be run from inside a datastep, and works on both UUID formatted strings and 32 character hex strings.
If you're running a SAS version below SAS9.4M7 then you WILL need to define the MASM2PATH and MAS_PYPATH environment variables in the appropriate sasv9.cfg / sasv9_usermods.cfg file(s).
proc fcmp outlib=work.funcs.funcs;
/* SYSTEM REQUIREMENTS: SAS94M6. PYTHON2.7 OR ABOVE. MAS_M2PATH AND MAS_PYPATH MUST BE SET AS SYSTEM ENVIRONMENT VARIABLES */
function uuid_to_decimal(str $) $40;
* CREATE THE INNER PYTHON FUNCTION CALL;
declare object py(python);
submit into py;
def u2d( s ):
"Output: u2d_output"
import uuid
return str(uuid.UUID(s).int)
endsubmit;
rc = py.publish(); * PUBLISH THE CODE TO THE PYTHON INTERPRETER;
rc = py.call("u2d", str); * CALL THE PYTHON FUNCTION FROM SAS;
return(py.results["u2d_output"]);
endsub;
run;
option cmplib=work.funcs;
data _null_;
length test1 test2 $40;
do cnt=1 to 500000;
test1 = uuid_to_decimal("a341b0d0-1ec5-11eb-b64d-02b1159b38e9");
test2 = uuid_to_decimal("a341b0d01ec511ebb64d02b1159b38e9");
end;
put test1= test2=;
run;
#4 - proc groovy (thanks to #Richard for this solution)
A faster solution than the FCMP/Python solution but the big downside to me is that you need to dump the values to convert then import and join them back to the original dataset afterwards. The hidden performance cost will then depend on the size and width of the dataset you are working with.
%let work = %sysfunc(pathname(work));
data have;
input hexstr $32.;
do i = 1 to 1000000;
output;
end;
drop i;
datalines;
81A8221704654BAC9B24CDFE5A834540
;
run;
proc export data=have replace outfile="&work/have.txt" dbms=tab;
putnames=no;
run;
proc groovy;
submit "&work"; * PASS IN OUR WORK PATH AS THE FIRST PARAMETER;
File out = new File(args[0]+"/want.txt");
out.write "hexstr,decstr\n";
new File (args[0]+"/have.txt").eachLine { line ->
out.append(line+",\""+new BigInteger(line, 16).toString()+"\"\n");
}
endsubmit;
quit;
options source;
proc import datafile="&work/want.txt" replace out=want dbms=csv;
run;
I have a SAS dataset, with the following variables: ID, Var1_0, Var1_3, Var1_6, Var2_0, Var2_3, Var2_6, which can be read like this: Var1_0 is parameter 1 at time 0. For every subjects I have 2 variables and 3 time points. I want to transpose this into a long format using an array, I did this:
data long;
set wide;
array a Var1_0 Var1_3 Var1_6 Var2_0 Var_3 Var_6;
Do i=1 to dim(a);
outcome = a[i];
*Var = ???;
if (mod(i,3)=1) then Time = 0;
else if (mod(i,3)=2) then Time = 3;
else Time = 6;
Output;
end;
keep ID Outcome Time;
run;
The problem is that I don't know how to calculate the parameter variable, i.e., I want to add a variables that is either 1 or 2, depending on which parameter the value is related to. Is there a better way of doing this? Thank you !
Reeza gave you the answer in her comment. Here it is typed out.
data long;
set wide;
array a[*] Var1_0 Var1_3 Var1_6 Var2_0 Var2_3 Var2_6;
do i=1 to dim(a);
outcome = a[i];
var = vname(a[i]);
time = input(scan(var,2,'_'),best.);
/*Other stuff you want to do*/
output;
end;
run;
VNAME(array[sub]) gives you the variable name of the variable referenced by array[sub].
scan(str,i,delim) gives you the ith word in str using the specified delimiter.
This is a programming question, but I'll give you a little of the stats background first. This question refers to part of a data sim for a mixed-effects location scale model (i.e., heterogeneous variances). I'm trying to simulate two MVN variance components using the RANDNORMAL function in IML. Because both variance components are heterogeneous, the variances used by RANDNORMAL will differ across people. Thus, I need IML to select the specific row (e.g., row 1 = person 1) and use the RANDNORMAL function before moving onto the next row, and so on.
My example code below is for 2 people. I use DO to loop through each person's specific variance components (VC1 and VC2). I get the error: "Module RANDNORMAL called again before exit from prior call." I am assuming I need some kind of BREAK or EXIT function in the DO loop, but none I have tried work.
PROC IML;
ColNames = {"ID" "VC1" "VC2"};
A = {1 2 3,
2 8 9};
PRINT A[COLNAME=ColNames];
/*Set men of each variance component to 0*/
MeanVector = {0, 0};
/*Loop through each person's data using THEIR OWN variances*/
DO i = 1 TO 2;
VC1 = A[i,2];
VC2 = A[i,3];
CovMatrix = {VC1 0,
0 VC2};
CALL RANDSEED(1);
U = RANDNORMAL(2, MeanVector, CovMatrix);
END;
QUIT;
Any help is appreciated. Oh, and I'm using SAS 9.4.
You want to move some things around, but mostly you don't want to rewrite U twice: you need to write U's 1st row, then U's 2nd row, if I understand what you're trying to do. The below is a bit more efficient also, since I j() the U and _cv matrices rather than constructing then de novo every time through the loop (which is slow).
proc iml;
a = {1 2 3,2 8 9};
print(a);
_mv = {0,0};
U = J(2,2);
_cv = J(2,2,0);
CALL RANDSEED(1);
do i = 1 to 2;
_cv[1,1] = a[i,2];
_cv[2,2] = a[i,3];
U[i,] = randnormal(1,_mv, _cv);
end;
print(u);
quit;
Your mistake is the line
CovMatrix = {VC1 0, 0 VC2}; /* wrong */
which is not valid SAS/IML syntax. Instead, use #Joe's approach or use
CovMatrix = (VC1 || 0) // (0 || VC2);
For details, see the article "How to build matrices from expressions."
You might also be interested in this article that describes how to carry out this simulation with a block-diagonal matrix: "Constructing block matrices with applications to mixed models."
I've been trying to find the simplest way to generate simulated time series datasets in SAS. I initially was experimenting with the LAG operator, but this requires input data, so is proabably not the best way to go. (See this question: SAS: Using the lag function without a set statement (to simulate time series data.))
Has anyone developed a macro or dataset that enables time series to be genereated with an arbitrary number of AR and MA terms? What is the best way to do this?
To be specific, I'm looking to generate what SAS calls an ARMA(p,q) process, where p denotes the autoregressive component (lagged values of the dependent variable), and q is the moving average component (lagged values of the error term).
Thanks very much.
I have developed a macro to attempt to answer this question, but I'm not sure whether this is the most efficient way of doing this. Anyway, I thought it might be useful to someone:
%macro TimeSeriesSimulation(numDataPoints=100, model=y=e,outputDataSetName=ts, maxLags=10);
data &outputDataSetName (drop=j);
array lagy(&maxlags) _temporary_;
array lage(&maxlags) _temporary_;
/*Initialise values*/
e = 0;
y=0;
t=1;
do j = 1 to 10;
lagy(j) = 0;
lage(j) = 0;
end;
output;
do t = 2 to &numDataPoints; /*Change this for number of observations*/
/*SPECIFY MODEL HERE*/
e = rannorm(-1); /*Draw from a N(0,1)*/
&model;
/*Update values of lags on the moving average and autoregressive terms*/
do j = &maxlags-1 to 1 by -1; /*Note you have to do this backwards because otherwise you cascade the current value to all past values!*/
lagy(j+1) = lagy(j);
lage(j+1) = lage(j);
end;
lagy(1) = y;
lage(1) = e;
output;
end;
run;
%mend;
/*Example 1: Unit root*/
%TimeSeriesSimulation(numDataPoints=1000, model=y=lagy(1)+e)
/*Example 2: Simple process with AR and MA components*/
%TimeSeriesSimulation(numDataPoints=1000, model=y=0.5*lagy(1)+0.5*lage(1)+e)
Using SAS proc IML, I have a function: CVF(m,p,h,pi,e);
I would like to guess h which minimizes this function. Is there some built-in subroutine to minimize it? Or how can I construct an iterative process for it? Every other variables are defined.
Use call nlpqn();. The function you pass needs to have only 1 parameter, the vector you want to optimize over. So here I define a quadratic function where the a, b, and c parameters are also able to be defined. Use the GLOBAL statement and define the non-moving variables before the call.
Alternatively, you can put everything in the input vector, and then add constraints to keep those values from moving.
proc iml;
start myfun(x) global(a, b, c);
out = a*x**2 + b*x + c;
return (out);
finish myfun;
a = 1;
b = 2;
c = 4;
optn = {0, /* option 1: 0 -> MIN, 1 -> MAX*/
0 /* Print options 0-5 0 least, 5 most*/
};
init = 0;
call nlpqn(rc, res, "myfun", init);
/*rc > 0 means success*/
print rc res;
quit;
Returns:
The SAS System 08:59 Friday, June 27, 2014 3
rc res
3 -1
You can build your own optimization routine with IML language, of course, but you have also optimization routines built-in, see chapter 11 of IML User's Guide, Nonlinear Optimization Examples.