I have a data similar to this
(This won't work because of an Array subscript out of range error):
data test;
array id {5} (1, 8, 4, 12, 23);
array a_ {5};
do i = 1 to 5;
a_[id[i]] = id[i];
end;
run;
what I want to do is,
create variables begins with 'a_' and the values of array id.
Meaning : a_1, a_8, a_4, a_12, a_23
This will only work if I declare array a_ with 23 members:
data test;
array id {5} (1, 8, 4, 12, 23);
array a_ {23};
do i = 1 to 5;
a_[id[i]] = id[i];
end;
run;
But then I get lots of missing variables I don't need.
I only want the above 5.
How can I achieve that?
PROC TRANSPOSE is usually the easiest way to do this.
First, make a vertical dataset like so:
data vert;
array id[5] (1,8,4,12,23);
do _i = 1 to dim(id);
varname = cats('A_',id[_i]);
vvalue = 1; *it is not apparent to me what the value should be in A_12 or whatnot;
output;
end;
run;
Then PROC TRANSPOSE makes your desired dataset.
proc transpose data=vert out=want;
id varname;
var vvalue;
run;
Related
I have a 2 column dataset - accounts and attributes, where there are 6 types of attributes.
I am trying to use PROC TRANSPOSE in order to set the 6 different attributes as 6 new columns and set 1 where the column has that attribute and 0 where it doesn't
This answer shows two approaches:
Proc TRANSPOSE, and
array based transposition using index lookup via hash.
For the case that all of the accounts missing the same attribute, there would be no way for the data itself to exhibit all the attributes -- ideally the allowed or expected attributes should be listed in a separate table as part of your data reshaping.
Proc TRANSPOSE
When working with a table of only account and attribute you will need to construct a view adding a numeric variable that can be transposed. After TRANSPOSE the result data will have to be further massaged, replacing missing values (.) with 0.
Example:
data have;
call streaminit(123);
do account = 1 to 10;
do attribute = 'a','b','c','d','e','f';
if rand('uniform') < 0.75 then output;
end;
end;
run;
data stage / view=stage;
set have;
num = 1;
run;
proc transpose data=stage out=want;
by account;
id attribute;
var num;
run;
data want;
set want;
array attrs _numeric_;
do index = 1 to dim(attrs);
if missing(attrs(index)) then attrs(index) = 0;
end;
drop index;
run;
proc sql;
drop view stage;
From
To
Advanced technique - Array and Hash mapping
In some cases the Proc TRANSPOSE is deemed unusable by the coder or operator, perhaps very many by groups and very many attributes. An alternate way to transpose attribute values into like named flag variables is to code:
Two scans
Scan 1 determine attribute values that will be encountered and used as column names
Store list of values in a macro variable
Scan 2
Arrayify the attribute values as variable names
Map values to array index using hash (or custom informat per #Joe)
Process each group. Set arrayed variable corresponding to each encountered attribute value to 1. Array index obtained via lookup through hash map.
Example:
* pass #1, determine attribute values present in data, the values will become column names;
proc sql noprint;
select distinct attribute into :attrs separated by ' ' from have;
* or make list of attributes from table of attributes (if such a table exists outside of 'have');
* select distinct attribute into :attrs separated by ' ' from attributes;
%put NOTE: &=attrs;
* pass #2, perform array based tranposformation;
data want2(drop=attribute);
* prep pdv, promulgate by group variable attributes;
if 0 then set have(keep=account);
array attrs &attrs.;
format &attrs. 4.;
if _n_=1 then do;
declare hash attrmap();
attrmap.defineKey('attribute');
attrmap.defineData('_n_');
attrmap.defineDone();
do _n_ = 1 to dim(attrs);
attrmap.add(key:vname(attrs(_n_)), data: _n_);
end;
end;
* preset all flags to zero;
do _n_ = 1 to dim(attrs);
attrs(_n_) = 0;
end;
* DOW loop over by group;
do until (last.account);
set have;
by account;
attrmap.find(); * lookup array index for attribute as column;
attrs(_n_) = 1; * set flag for attribute (as column);
end;
* implicit output one row per by group;
run;
One other option for doing this not using PROC TRANSPOSE is the data step array technique.
Here, I have a dataset that hopefully matches yours approximately. ID is probably your account, Product is your attribute.
data have;
call streaminit(2007);
do id = 1 to 4;
do prodnum = 1 to 6;
if rand('Uniform') > 0.5 then do;
product = byte(96+prodnum);
output;
end;
end;
end;
run;
Now, here we transpose it. We make an array with the six variables that could occur in HAVE. Then we iterate through the array to see if that variable is there. You can add a few additional lines to the if first.id block to set all of the variables to 0 instead of missing initially (I think missing is better, but YMMV).
data want;
set have;
by id;
array vars[6] a b c d e f;
retain a b c d e f;
if first.id then call missing(of vars[*]);
do _i = 1 to dim(vars);
if lowcase(vname(vars[_i])) = product then
vars[_i] = 1;
end;
if last.id then output;
run;
We could do it a lot faster if we knew how the dataset was constructed, of course.
data want;
set have;
by id;
array vars[6] a b c d e f;
if first.id then call missing(of vars[*]);
retain a b c d e f;
vars[rank(product)-96]=1;
if last.id then output;
run;
While your data doesn't really work that way, you could make an informat though that did this.
*First we build an informat relating the product to its number in the array order;
proc format;
invalue arrayi
'a'=1
'b'=2
'c'=3
'd'=4
'e'=5
'f'=6
;
quit;
*Now we can use that!;
data want;
set have;
by id;
array vars[6] a b c d e f;
if first.id then call missing(of vars[*]);
retain a b c d e f;
vars[input(product,arrayi.)]=1;
if last.id then output;
run;
This last one is probably the absolute fastest option - most likely much faster than PROC TRANSPOSE, which tends to be one of the slower procs in my book, but at the cost of having to know ahead of time what variables you're going to have in that array.
Is it possible to get the frequency of an entire table in SAS? For example I want to count how many yes's or no's are in an entire table? Thanks
A hash component object has keys and can track .FIND references in key summary variable specified with the keysum: tag attribute supplied at instantiation. The keysum variable, when incremented by 1 per suminc: variable will compute a frequency count.
data have;
* Words array from Abstract;
* "How Do I Love Hash Tables? Let Me Count The Ways!";
* by Judy Loren, Health Dialog Analytic Solutions;
* SGF 2008 - Beyond the Basics;
* https://support.sas.com/resources/papers/proceedings/pdfs/sgf2008/029-2008.pdf;
array words(17) $10 _temporary_ (
'I' 'love' 'hash' 'tables'
'You' 'will' 'too' 'after' 'you' 'see'
'what' 'they' 'can' 'do' '--' 'Judy' 'Loren'
);
call streaminit(123);
do row = 1 to 127;
attrib RESPONSE1-RESPONSE20 length = $10;
array RESPONSE RESPONSE1-RESPONSE20;
do over RESPONSE;
RESPONSE = words(rand('integer', 1, dim(words)));
end;
output;
end;
run;
data _null_;
set have;
if _n_ = 1 then do;
length term $10;
call missing (term);
retain one 1;
retain count 0;
declare hash bins(suminc:'one', keysum:'count');
bins.defineKey('term');
bins.defineData('term');
bins.defineDone();
end;
set have end=lastrow;
array response response1-response20;
do over response;
if bins.find(key:response) ne 0 then do;
bins.add(key:response, data:response, data:1);
end;
end;
if lastrow;
bins.output(dataset:'all_freq');
run;
Original answer, presumed only Yes and No
Yes. You can array values, compute as 0/1 flag for each No/Yes value and then use SUM to count 0's and 1's. SUM computes FREQ only when dealing with just 0's and 1's.
Example:
data have;
call streaminit(123);
do row = 1 to 100;
attrib ANSWER1-ANSWER20 length = $3;
array ANSWER ANSWER1-ANSWER20;
do over ANSWER; ANSWER = ifc(rand('uniform') > 0.15,'Yes','No'); end;
output;
end;
run;
data want(keep=freq_1 freq_0);
set have end=lastrow;
array ANSWER ANSWER1-ANSWER20;
array X(20) _temporary_;
do over ANSWER; x(_I_) = ANSWER = 'Yes'; end;
freq_1 + sum (of X(*));
freq_0 + dim(X) - sum (of X(*));
if lastrow;
run;
Transpose your main data and then do a proc freq. This is fully dynamic and scales if the number of question scales or the responses scale. You do need to have all variables be of the same type though - character or numeric.
*generate fake data;
data have;
call streaminit(99);
array q(30) q1-q30;
do i=1 to 100;
do j=1 to dim(q);
q(j) = rand('bernoulli', 0.8);
end;
output;
end;
run;
*flip it to a long format;
proc transpose data=have out=long;
by I;
var q1-q30;
run;
*get the summaries needed;
proc freq data=long;
table col1;
run;
You should get output as follows:
The FREQ Procedure
COL1 Frequency Percent Cumulative
Frequency Cumulative
Percent
0 581 19.37 581 19.37
1 2419 80.63 3000 100.00
libname Prob 'Y:\alsdkjf\alksjdfl';
the geoid here is char I wanna convert to num to be able to merge by id;
data Problem2_1;
set Prob.geocode;
id = substr(GEOID, 8, 2);
id = input(id, best5.);
output;
run;
geoid here is numeric;
data Problem2_2; c
set Prob.households;
id = GEOID;
output;
run;
data Problem2_3;
merge Problem2_1
Problem2_2 ;
by ID;
run;
proc print data = Problem2_3;
*ERROR: Variable geoid has been defined as both character and numeric.
*ERROR: Variable id has been defined as both character and numeric.
It looks like you could replace these two lines:
id = substr(GEOID, 8, 2);
id = input(id, best5.);
With:
id = input(substr(GEOID, 8, 2), best.);
This would mean that both merge datasets contain numeric id variables.
SAS requires the linking id to be same data type. This means that you have to convert the int to string or vice versa. Personally, I prefer to convert to numeric when ever it is possible.
A is a worked out example:
/*Create some dummy data for testing purposes:*/
data int_id;
length id 3 dummy $3;
input id dummy;
cards;
1 a
2 b
3 c
4 d
;
run;
data str_id;
length id $1 dummy2 $3;
input id dummy2;
cards;
1 aa
2 bb
3 cc
4 dd
;
run;
/*Convert string to numeric. Int in this case.*/
data str_id_to_int;
set str_id;
id2 =id+0; /* or you could use something like input(id, 8.)*/
/*The variable must be new. id=id+0 does _not_ work.*/
drop id; /*move id2->id */
rename id2=id;
run;
/*Same, but other way around. Imho, trickier.*/
data int_id_to_str;
set int_id;
id2=put(id, 1.); /*note that '1.' refers to lenght of 1 */
/*There are other ways to convert int to string as well.*/
drop id;
rename id2=id;
run;
/*Testing. Results should be equivalent */
data merged_by_str;
merge str_id(in=a) int_id_to_str(in=b);
by id;
if a and b;
run;
data merged_by_int;
merge int_id(in=a) str_id_to_int(in=b);
by id;
if a and b;
run;
For Problem2_1, if your substring contains only numbers you can coerce it to numeric by adding zero. Something like this should make ID numeric and then you could merge with Problem2_2.
data Problem2_1;
set Prob.geocode;
temp = substr(GEOID, 8, 2);
id = temp + 0;
drop temp;
run;
EDIT:
Your original code originally defines ID as the output of substr, which is character. This should work as well:
data Problem2_1;
set Prob.geocode;
temp = substr(GEOID, 8, 2);
id = input(temp, 8.0);
drop temp;
run;
I want to find the way to build another variable (it's ok even in the same dataset) that is the categorization of the old variable. I would choose the number of the buckets (for exemples using percentiles as cutoffs: p10, p20, p30, etc.).
Now I do this thing extracting the percentiles of the variable with proc univariate. But this give me only the percentiles (my cutoffs) and then I have to build the new variable manually using the percentiles.
How can I create this new variable giving the cutoffs and the number of buckets as input?
thanks in advance
Assuming you want equal percentage sized buckets, then PROC RANK might just get you want you are looking for.
data test;
do i=1 to 100;
output;
end;
run;
proc rank data=test out=test2 groups=5;
var i;
ranks grp;
run;
That will give you 5 groups (named 0 .. 4), which should be equivalent to P20, P40, ..., P80 cutoffs.
If you wanted non-equal buckets, ie P10, P40, P60, and P90, then you would have to choose the lowest level and combine groups. Using the groups above:
%let groups=10;
proc rank data=test out=test2 groups=&groups;
var var;
ranks grp;
run;
/*
P = (grp+1)*&groups
Cutoffs 10, 40, 60, 90
implicit 5 new groups
*/
%let n_cutoff=4;
%let cutoffs=10, 40, 60, 90;
data test3(drop=_i cutoffs:);
set test2;
array cutoffs[&n_cutoff] (&cutoffs);
P = (grp+1)*&groups;
do _i=1 to &n_cutoff;
if P <= cutoffs[_i] then do;
new_grp = _i-1;
leave;
end;
if _i = &n_cutoff then
new_grp = _i;
end;
run;
10 is the lowest common denominator of the P values. 100/10 = 10 so we need 10 groups from PROC RANK.
The Data Step at the end combines the groups using the cutoffs you are looking for.
This question already has an answer here:
Customize Output Table to Include Missing Variable
(1 answer)
Closed 8 years ago.
I am trying to perform frequency analyses on test scores.
My data set has students with recorded scores.
Like below:
student, score
1, 1
2, 1
3, 1
4, 3
5, 3
6, 3
7, 4
8, 4
9, 4
10, 4
I run the code:
proc freq data=stuff;
var score;
run;
The output:
score, freq, pct, cum.freq., cum.pct.
1, 3, .3, 3, .3
3, 3, .3, 6, .6
4, 4, .4, 10, 1
I would like to show:
score, freq, pct, cum.freq., cum.pct.
1, 3, .3, 3, .3
2, 0, 0, 3, .3
3, 3, .3, 6, .6
4, 4, .4, 10, 1
5, 0, 0, 10, 1
Is there a way that SAS can do this?
Thanks for the help.
There is always a way. Here are a few approaches. They either involve building the table by hand or preprocessing the input before asking SAS to summarise.
These will be used as inputs in the approaches below
/* Your input data */
data have;
infile datalines dlm = ",";
input student score;
datalines;
1,1
2,1
3,1
4,3
5,3
6,3
7,4
8,4
9,4
10,4
;
run;
%let max = 5;
/* Create a dummy set of all the desired groups */
data numbers;
do score = 1 to &max;
output;
end;
run;
This approach adds dummy values to the original data and uses proc freq's weight function to decide which rows should contribute to the count (I prefer this method)
/* Combine the inputs and add weight variable */
data input;
set
have (in = a)
numbers;
/* Set weight so that empty groups won't be counted */
weight = ifn(a, 1, 0)
run;
/* Create the report */
proc freq data = input;
table score;
/* Only add to the count when weight = 1, include 0 rows */
weight weight / zeros;
run;
This approach builds the table by hand
/* Get the total count and each group's count */
proc summary data = have;
class score;
output out = counts;
run;
/* Combine and create the summary stats */
data want;
merge
counts (rename = (_FREQ_ = freq))
numbers;
by score;
/* Keep the total and cumulative total when new rows are loaded */
retain total cumsum;
/* Set up the total and delete the total row */
if _TYPE_ = 0 then do;
total = freq;
delete;
end;
/* Prevent missing values by adding 0 */
freq + 0;
/* Calculate stats */
pct = freq / total;
cumsum + freq;
cumpct = cumsum / total;
drop _TYPE_ total;
run;
proc print;run;
This approach modifies the proc freq output to add in the extra rows
/* Create the frequency report as a dataset */
proc freq data = have;
table score;
/* Output as dataset */
ods output OneWayFreqs = freqs (drop = table f_score);
run;
/* Combine and build the extra rows */
data want (drop = _:);
merge
freqs (in = f)
numbers;
by score;
/* Set up temporary variables for storing cumulatives */
retain _lp _lf;
/* Store cumulatives */
if f then do;
_lf = CumFrequency;
_lp = CumPercent;
end;
/* Put stored values into new rows */
else do;
Frequency = 0;
Percent = 0;
CumFrequency = _lf;
CumPercent = _lp;
end;
run;
proc print; run;
Using Joe's suggestion above will also get you part of the way, but unfortunately cannot compute the cumulative columns.
This is a partial solution using Joe's method
/* Create a format showing each of the desired levles */
proc format;
value score
1='1'
2='2'
3='3'
4='4'
5='5';
quit;
proc tabulate data = have;
/* Specify that every formatted value should be included */
class score / preloadfmt;
/* Apply the format */
format score score.;
tables score,
/* Request the output table include frequency and percent */
(n="freq" pctn="cumfreq" ) /
/* Include missings and display them as 0s */
printmiss misstext="0";
run;