multiples of 8 - optimal length for SAS character variables? - sas

I heard that SAS stores character variables in chunks of 8 bytes.
Therefore, the thinking goes we should always assign the length of the character variables to be a multiple of 8.
I have searched and could not find any support for the initial assertion.
Is it true? Is this covered somewhere in the documentation?

This is true for datasets that contain no 8 byte numeric variables. I will post separately for datasets that do.
No, there is nothing special about 8 byte character variable lengths.
See the below:
data length8;
length char0001-char9999 $8;
call missing(of _all_);
do _i = 1 to 100;
output;
end;
drop _i;
run;
data length7;
length char0001-char9999 $7;
call missing(of _all_);
do _i = 1 to 100;
output;
end;
drop _i;
run;
data length4;
length char0001-char9999 $4;
call missing(of _all_);
do _i = 1 to 100;
output;
end;
drop _i;
run;
data length12;
length char0001-char9999 $12;
call missing(of _all_);
do _i = 1 to 100;
output;
end;
drop _i;
run;
data length16;
length char0001-char9999 $16;
call missing(of _all_);
do _i = 1 to 100;
output;
end;
drop _i;
run;
data length17;
length char0001-char9999 $17;
call missing(of _all_);
do _i = 1 to 100;
output;
end;
drop _i;
run;
Each of these datasets is of different size, roughly proportional to the length of the character variables. Note that the 4 size is a bit bigger proportionally (on my machine, anyway): in fact, 4,5,6 are all the same size. This is because of the page size: the minimum page size on my installation is 64kb (65535 bytes), and 4,5,6 all can only fit one row of data in that (roughly 40, 50, and 60kb rows). It's not because of any particular size being saved for a character variable, but instead because of the total length of the data record.
That's where you could potentially have a savings by altering a small amount: if your data happen to be arranged such that the page size is just under double the size of the row, then making the row just slightly smaller will save you half of the space. That's unlikely to occur except on a very small number of cases though - it requires a very wide row (many variables, or very long character variables). You also can alter the page size with options, though, which may be the better way to deal with edge cases like this.

For datasets that contain a numeric variable, as #jaamor's example included, there is a difference that does have some impact on storage related to 8 byte size. It will not usually have a significant impact on dataset size, except on a very tall and narrow dataset, but for datasets that are very tall and narrow, it may be a consideration.
When a numeric variable that is 8 bytes (the default) in length, SAS places those numeric variables at the end of the data vector, and starts them at a multiple of 8 bytes, presumably to aid in efficiency at accessing those predictable numeric variables. Any other variable other than an 8 byte numeric will be placed at the start of the data vector, and then any padding needed to bring that up to a multiple of 8 bytes is added, and then the numeric 8 byte variables are placed after that.
This can be seen by looking at the proc contents output from some example datasets.
data fourteen_eight;
length x y $7; *14 total;
length i 8;
run;
data twelve_eight;
length x y $6; *12 total;
length i 8;
run;
data twelve_six;
length x y $6; *12 total;
length i 6;
run;
data twelve_six_eight;
length x y $6;
length z 6;
length i 8;
run;
fourteen_eight has a conceptual observation length of 22, but a physical observation length of 24 (looking at PROC CONTENTS). twelve_eight has a conceptional length of 20, but a physical observation length of 24 as well. twelve_six has a conceptual length of 18, and a physical observation length of 18 - meaning no buffer if the numeric variable isn't 8 long. twelve_six_eight has a conceptual length of 26, and a physical size of 32: 18 rounded up to 24, and then the 8 at the end. (You can verify it's not allocating 8 for each numeric variable by simply adding several more 6 byte numbers; they never increase the total padding, and fit neatly in a smaller space.)
Here's how it ends up looking:
x $6
y $6
z 6
i 8
would fit like so:
[00000000011111111112222222222333333333344444444445]
[12345678901234567890123456789012345678901234567890]
[xxxxxxyyyyyyzzzzzz iiiiiiii]
One side note: I'm not 100% sure that it's not [iiiiiiiixxxxxxyyyyyyzzzz ]. That would work just as well as far as being able to predict the location of numeric variables. It doesn't really affect this, though: either way, yes, there will be a small buffer if your total non-8-byte-numeric storage is not a multiple of 8 bytes if you do have one or more 8 byte numeric variables.

As Joe said, I did test empirically using the below script:
libname testlen "<directory>";
%macro create_ds(length=, dsName=);
data &dsName;
length x $&length.;
do i=1 to 1000000;
x="";
output;
end;
run;
%mend;
%macro create_all_ds;
%do i=1 %to 20;
%create_ds(length=&i, dsName=testlen.len&i)
%end;
%mend;
%create_all_ds
All datasets have one variable. The length of the variable varies across datasets, starting from 1 to 20.
Datasets 1-8 take up ~15.8 MB
Datasets 9-16 take up ~23.7 MB
Datasets 16-20 take up ~31.5 MB
This probably means that it is not space efficient to declare SAS variable lengths that are not multiples of 8 for 1 variable datasets.
I tried a similar test for 2 variable datasets:
%macro create_ds(length=, dsName=);
data &dsName;
length x y $&length.;
do i=1 to 1000000;
x="";
y="";
output;
end;
run;
%mend;
%macro create_all_ds;
%do i=1 %to 20;
%create_ds(length=&i, dsName=testlen.len&i)
%end;
%mend;
%create_all_ds
The results are as follows:
Datasets 1-4 take up ~15.8 MB
Datasets 5-8 take up ~23.7 MB
This could mean that for efficient length declarations the sum of the length of the character variables should be a multiple of eight.

Related

mean of 10 variables with different starting point (SAS)

I have 18 numerical variables pm25_total2000 to pm25_total2018
Each person have a starting year between 2013 and 2018, we can call that variable "reqyear".
Now I want to calculate mean for each persons 10 years before the starting year.
For example if a person have starting year 2015 I want mean(of pm25_total2006-pm25_total2015)
Or if a person have starting year 2013 I want mean(of pm25_total2004-pm25_total2013)
How to do this?
data _null_;
set scapkon;
reqyear=substr(iCDate,1,4)*1;
call symput('reqy',reqyear);
run;
data scatm;
set scapkon;
/* Medelvärde av 10 år innan rekryteringsår */
pm25means=mean(of pm25_total%eval(&reqy.-9)-pm25_total%eval(&reqy.));
run;
%eval(&reqy.-9) will be constant value (the same value for all as for the first person) , in my case 2007
That doesn't work.
You can compute the mean with a traditional loop.
data want;
set have;
array x x2000-x2018;
call missing(sum, mean, n);
do _n_ = 1 to 10;
v = x ( start - 1999 -_n_ );
if not missing(v) then do;
sum + v;
n + 1;
end;
end;
if n then mean = sum / n;
run;
If you want to flex your SAS skill, you can use POKE and PEEK concepts to copy a fixed length slice (i.e. a fixed number of array elements) of an array to another array and compute the mean of the slice.
Example:
You will need to add sentinel elements and range checks on start to prevent errors when start-10 < 2000.
data have;
length id start x2000-x2018 8;
do id = 1 to 15;
start = 2013 + mod(id,6);
array x x2000-x2018;
do over x;
x = _n_;
_n_+1;
end;
output;
end;
format x: 5.;
run;
data want;
length id start mean10yrPriorStart 8;
set have;
array x x2000-x2018;
array slice(10) _temporary_;
call pokelong (
peekclong ( addrlong ( x(start-1999-10) ) , 10*8 ) ,
addrlong ( slice (1))
);
mean10yrPriorStart = mean(of slice(*));
run;
use an array and loop
index the array with years
accumulate the sum of the values
accumulate the count to account for any missing values
divide to obtain the mean value
data want;
set have;
array _pm(2000:2018) pm25_total2000 - pm25_total2018;
do year=reqyear to (reqyear-9) by -1;
*add totals;
total = sum(total, _pm(year));
*add counts;
nyears = sum(nyears,not missing(_pm(year)));
end;
*accounts for possible missing years;
mean = total/nyears;
run;
Note this loop goes in reverse (start year to 9 years previous) because it's slightly easier to understand this way IMO.
If you have no missing values you can remove the nyears step, but not a bad thing to include anyways.
NOTE: My first answer did not address the OP's question, so this a redux.
For this solution, I used Richard's code for generating test data. However, I added a line to randomly add missing values.
x = _n_;
if ranuni(1) < .1 then x = .;
_n_+1;
This alternative does not perform any checks for missing values. The sum() and n() functions inherently handle missing values appropriately. The loop over the dynamic slice of the data array only transfers the value to a temporary array. The final sum and count is performed on the temp array outside of the loop.
data want;
set have;
array x(2000:2018) x:;
array t(10) _temporary_;
j = 1;
do i = start-9 to start;
t(j) = x(i);
j + 1;
end;
sum = sum(of t(*));
cnt = n(of t(*));
mean = sum / cnt;
drop x: i j;
run;
Result:
id start sum cnt mean
1 2014 72 7 10.285714286
2 2015 305 10 30.5
3 2016 458 9 50.888888889
4 2017 631 9 70.111111111

Setting all array values to a single value

I'd like to set all values in an array to 1 if some sort of condition is met, and perform a calculation if the condition isn't met. I'm using a do loop at the moment which is very slow.
I was wondering if there was a faster way.
data test2;
set test1;
array blah_{*} blah1-blah100;
array a_{*} a1-a100;
array b_{*} b1-b100;
do i=1 to 100;
blah_{i}=a_{i}/b_{i};
if b1=0 then blah_{i}=1;
end;
run;
I feel like the if statement is inefficient as I am setting the value 1 cell at a time. Is there a better way?
There are already several good answers, but for the sake of completeness, here is an extremely silly and dangerous way of changing all the array values at once without using a loop:
data test2;
set test1;
array blah_{*} blah1-blah100 (100*1);
array a_{*} a1-a100;
array b_{*} b1-b100;
/*Make a character copy of what an array of 100 1s looks like*/
length temp $800; *Allow 8 bytes per numeric variable;
retain temp;
if _n_ = 1 then temp = peekclong(addrlong(blah1), 800);
do i=1 to 100;
blah_{i}=a_{i}/b_{i};
end;
/*Overwrite the array using the stored value from earlier*/
if b1=0 then call pokelong(temp,addrlong(blah1),800);
run;
You have 100*NOBS assignments to do. Don't see how using a DO loop over an ARRAY is any more inefficient than any other way.
But there is no need to do the calculation when you know it will not be needed.
do i=1 to 100;
if b1=0 then blah_{i}=1;
else blah_{i}=a_{i}/b_{i};
end;
This example uses a data set to "set" all values of an array without DOingOVER the array. Note that using SET in this way changes INIT-TO-MISSING for array BLAH to don't. I cannot comment on performance you will need to do your own testing.
data one;
array blah[10];
retain blah 1;
run;
proc print;
run;
data test1;
do b1=0,1,0;
output;
end;
run;
data test2;
set test1;
array blah[10];
array a[10];
array b[10];
if b1 eq 0 then set one nobs=nobs point=nobs;
else do i = 1 to dim(blah);
blah[i] = i;
end;
run;
proc print;
run;
This is not a response to the original question, but as a response to the discussion on the efficiency between using loops vs set to set the values for multiple variables
Here is a simple experiment that I ran:
%let size = 100; /* Controls size of dataset */
%let iter = 1; /* Just to emulate different number of records in the base dataset */
data static;
array aa{&size} aa1 - aa&size (&size * 1);
run;
data inp;
do ii = 1 to &iter;
x = ranuni(234234);
output;
end;
run;
data eg1;
set inp;
array aa{&size} aa1 - aa&size;
set static nobs=nobs point=nobs;
run;
data eg2;
set inp;
array aa{&size} aa1 - aa&size;
do ii = 1 to &size;
aa(ii) = 1;
end;
run;
What I see when I run this with various values of &iter and &size is as follows:
As &size increases for a &iter value of 1, assignment method is faster than the SET.
However for a given &size, as iter increases (i.e. the number of times the set statement / loop is called), the speed of the SET approach increases while the assignment method starts to decrease at a certain point at which they cross. I think this is because the transfer from physical disk to buffer happens just once (since static is a relatively small dataset) whereas the assignment loop cost is fixed.
For this use case, where the fixed dataset used to set values will be smaller, I admit that SET will be faster especially when the logic needs to execute on multiple records on the input and the number of variables that needs to be assigned are relatively few. This however will not be the case if the dataset cannot be cached in memory between two records in which case the additional overhead of having to read it into the buffer can slow it down.
I think this test isolates the statements of interest.
SUMMARY:
SET+create init array 0.40 sec. + 0.03 sec,
DO OVER array 11.64 sec.
NOTE: Additional host information:
X64_SRV12 WIN 6.2.9200 Server
NOTE: SAS initialization used:
real time 4.70 seconds
cpu time 0.07 seconds
1 options fullstimer=1;
2 %let d=1e4; /*array size*/
3 %let s=1e5; /*reps (obs)*/
4 data one;
5 array blah[%sysevalf(&d,integer)];
6 retain blah 1;
7 run;
NOTE: The data set WORK.ONE has 1 observations and 10000 variables.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
user cpu time 0.03 seconds
system cpu time 0.00 seconds
memory 7788.90k
OS Memory 15232.00k
Timestamp 08/17/2019 06:57:48 AM
Step Count 1 Switch Count 0
8
9 sasfile one open;
NOTE: The file WORK.ONE.DATA has been opened by the SASFILE statement.
10 data _null_;
11 array blah[%sysevalf(&d,integer)];
12 do _n_ = 1 to &s;
13 set one nobs=nobs point=nobs;
14 end;
15 stop;
16 run;
NOTE: DATA statement used (Total process time):
real time 0.40 seconds
user cpu time 0.40 seconds
system cpu time 0.00 seconds
memory 7615.31k
OS Memory 16980.00k
Timestamp 08/17/2019 06:57:48 AM
Step Count 2 Switch Count 0
2 The SAS System 06:57 Saturday, August 17, 2019
17 sasfile one close;
NOTE: The file WORK.ONE.DATA has been closed by the SASFILE statement.
18
19 data _null_;
20 array blah[%sysevalf(&d,integer)];
21 do _n_ = 1 to &s;
22 do i=1 to dim(blah); blah[i]=1; end;
23 end;
24 stop;
25 run;
NOTE: DATA statement used (Total process time):
real time 11.64 seconds
user cpu time 11.64 seconds
system cpu time 0.00 seconds
memory 3540.65k
OS Memory 11084.00k
Timestamp 08/17/2019 06:58:00 AM
Step Count 3 Switch Count 0
NOTE: SAS Institute Inc., SAS Campus Drive, Cary, NC USA 27513-2414
NOTE: The SAS System used:
real time 16.78 seconds
user cpu time 12.10 seconds
system cpu time 0.04 seconds
memory 15840.62k
OS Memory 16980.00k
Timestamp 08/17/2019 06:58:00 AM
Step Count 3 Switch Count 16
Some more interesting tests results based on data null 's original test. I added the following test also:
%macro loop;
data _null_;
array blah[%sysevalf(&d,integer)] blah1 - blah&d;
do _n_ = 1 to &s;
%do i = 1 %to &d;
blah&i = 1;
%end;
end;
stop;
run;
%mend;
%loop;
d s SET Method (real/cpu) %Loop (real/cpu) array based(real/cpu)
100 1e5 0.03/0.01 0.00/0.00 0.07/0.07
100 1e8 11.16/9.51 4.78/4.78 1:22.38/1:21.81
500 1e5 0.03/0.04 0.02/0.01 Did not measure
500 1e8 16.53/15.18 32.17/31.62 Did not measure
1000 1e5 0.03/0.03 0.04/0.03 0.74/0.70
1000 1e8 20.24/18.65 42.58/42.46 Did not measure
So with array based assignments, it is not the assignment that is the big culprit itself. Since arrays use a memory map to map the original memory locations, it appears that the memory location lookup for a given subscript is what really impacts performance. A direct assignment avoids this and significantly improves performance.
So if your array size is in the lower 100s, then direct assignment may not be a bad way to go. SET becomes effective when the array sizes go beyond a few hundreds.

Macro does not retain in a IF-THEN loop in DATA STEP

data output;
set input;
by id;
if first.id = 1 then do;
call symputx('i', 1); ------ also tried %let i = 1;
a_&i = a;
end;
else do;
call symputx('i', &i + 1); ------ also tried %let i = %sysevalf (&i + 1);
a_&i = a;
end;
run;
Example Data:
ID A
1 2
1 3
2 2
2 4
Want output:
ID A A_1 A_2
1 2 2 .
1 3 . 3
2 2 2 .
2 4 . 4
I know that you can do this using transpose, but i'm just curious why does this way not work. The macro does not retain its value for the next observation.
Thanks!
edit: Since %let is compile time, and call symput is execution time, %let will run only once and call symput will always be 1 step slow.
why does this way not work
The sequence of behavior in SAS executor is
resolve macro expressions
process steps
automatic compile of proc or data step (compile-time)
run the compilation (run-time)
a running data step can not modify its pdv layout (part of the compilation process) while it is running.
call symput() is performed at run-time, so any changes it makes will not and can not be applied to the source code as a_&i = a;
Array based transposition
You will need to determine the maximum number of items in the groups prior to coding the data step. Use array addressing to place the a value in the desired array slot:
* Hand coded transpose requires a scan over the data first;
* determine largest group size;
data _null_;
set have end=lastrecord_flag;
by id;
if first.id
then seq=1;
else seq+1;
retain maxseq 0;
if last.id then maxseq = max(seq,maxseq);
if lastrecord_flag then call symputx('maxseq', maxseq);
run;
* Use maxseq macro variable computed during scan to define array size;
data want (drop=seq);
set have;
by id;
array a_[&maxseq]; %* <--- set array size to max group size;
if first.id
then seq=1;
else seq+1;
a_[seq] = a; * 'triangular' transpose;
run;
Note: Your 'want' is a triangular reshaping of the data. To achieve a row per id reshaping the a_ elements would have to be cleared (call missing()) at first.id and output at last.id.

Function for conditioned mean in SAS

I have a dataset which some cells are valorized with 888888888 and 999999999. I would do the mean not considering these values. That is:
x=5, y=10, z=888888888
the mean will be 5.
How can I fix?
As you're calculating across variables, just store them in an array, loop through them and sum any that are less than the required threshold (I've used 100,000,000), then divide by the total number of variables to get the mean.
data have;
input x y z;
datalines;
5 10 888888888
4 20 999999999
;
run;
data want;
set have;
array vars{*} x y z;
_sum=0;
do _i = 1 to dim(vars);
if vars{_i}<1e8 then _sum+vars{_i};
end;
mean_vars = _sum/dim(vars);
drop _: ;
run;

Determine rates of change for different groups

I have a SAS issue that I know is probably fairly straightforward for SAS users who are familiar with array programming, but I am new to this aspect.
My dataset looks like this:
Data have;
Input group $ size price;
Datalines;
A 24 5
A 28 10
A 30 14
A 32 16
B 26 10
B 28 12
B 32 13
C 10 100
C 11 130
C 12 140
;
Run;
What I want to do is determine the rate at which price changes for the first two items in the family and apply that rate to every other member in the family.
So, I’ll end up with something that looks like this (for A only…):
Data want;
Input group $ size price newprice;
Datalines;
A 24 5 5
A 28 10 10
A 30 14 12.5
A 32 16 15
;
Run;
The technique you'll need to learn is either retain or diff/lag. Both methods would work here.
The following illustrates one way to solve this, but would need additional work by you to deal with things like size not changing (meaning a 0 denominator) and other potential exceptions.
Basically, we use retain to cause a value to persist across records, and use that in the calculations.
data want;
set have;
by group;
retain lastprice rateprice lastsize;
if first.group then do;
counter=0;
call missing(of lastprice rateprice lastsize); *clear these out;
end;
counter+1; *Increment the counter;
if counter=2 then do;
rateprice=(price-lastprice)/(size-lastsize); *Calculate the rate over 2;
end;
if counter le 2 then newprice=price; *For the first two just move price into newprice;
else if counter>2 then newprice=lastprice+(size-lastsize)*rateprice; *Else set it to the change;
output;
lastprice=newprice; *save the price and size in the retained vars;
lastsize=size;
run;
Here a different approach that is obviously longer than Joe's, but could be generalized to other similar situations where the calculation is different or depends on more values.
Add a sequence number to your data set:
data have2;
set have;
by group;
if first.group the seq = 0;
seq + 1;
run;
Use proc reg to calculate the intercept and slope for the first two rows of each group, outputting the estimates with outest:
proc reg data=have2 outest=est;
by group;
model price = size;
where seq le 2;
run;
Join the original table to the parameter estimates and calculate the predicted values:
proc sql;
create table want as
select
h.*,
e.intercept + h.size * e.size as newprice
from
have h
left join est e
on h.group = e.group
order by
group,
size
;
quit;