I have a data set with a number of records (the contents of which are irrelevant) and a series of flag variables at the end of each record. Something like this:
Record ID Flag 1 Flag 2 Flag 3
1 Y
2 Y Y
3
4
5
6 Y Y
I would like to create a printed report (or, ideally, a data set that I could then print) that would look something like the following:
Variable N % Not Missing
Flag 1 6 33.33333
Flag 2 6 33.33333
Flag 3 6 16.66666
I can accomplish something close to what I want for one variable at a time using Proc Freq with something like this:
proc freq data=Work.Records noprint;
tables Flag1 /out=Work.Temp;
run;
I suppose I could easily write a macro to loop through each variable and concatenate the results into one data set... but that seems WAY too complex for this. There has to be a built-in SAS procedure for this, but I'm not finding it.
Any thoughts out there?
Thanks!
Welcome to the fabulous world of Proc Tabulate.
PROC TABULATE is the procedure to use when PROC FREQ can't cut it. It works similarly to PROC FREQ in that it has table statements, but beyond that it's pretty much on steroids.
data test;
input RecordID (Flag1 Flag2 Flag3) ($);
datalines;
1 Y . .
2 . Y Y
3 . . .
4 . . .
5 . . .
6 . Y Y
;;;;
run;
proc tabulate data=test;
class flag1-flag3/missing;
tables (flag1-flag3),colpctn; *things that generate rows,things that generate columns;
run;
Unfortunately it's not easy to hide the blank rows without using some advanced style tricks, but if you don't mind seeing both rows this works pretty well.
Related
Both the datasets have same number of columns and column names. I stored most of the column names in sorted_cols macro which is basically separated by space. One more additional column exists as ID in both the sas datasets.some of the values in these are null(.) aswell. To convert the datasets to matrix m1 I tried this.
proc iml
varNames = {&sorted_col};
use Sashelp.Class(OBS=31);
read all var varNames into m1;
close Sashelp.Class;
print m[colname=VarNames];
quit;
obviously it is throwing errors.
I thought I should mention the dataset somewhere which I dont know where to add.The next step I thought would be easy by (m1/m2)*1000 where m1 and m2 being 2 matrices and storing in a new matrix. May be try to write it back to a dataset.I am prettty much naive in sas.
This is my objective one is the dataset1, two is dataset2 what I want to get is to do division of element1 from dataset2(two) by element 1 of dataset1(one).next do the cumulative adition of the previous answer over the column down.
This isn't particularly difficult to do in base SAS. Of course, in IML it's even easier, but if you're not familiar with IML there's no particular reason to go into it unless you want to learn it.
This is one way; there are some interesting options using weighting also, but I find this nice and simple and easy to read and understand. First I initialize one and two, then do the work. Edited to reflect your data, and some code changes, particularly to add the running total concept. PROC MEANS at the bottom is the alternative to the running total - it would only work if you removed the running total (the two lines relating to tempsums), as with that in it's summing the running totals, which is not correct.
data one;
input (d_201409-d_201412) (:best.);
row_label = _n_-1;
datalines;
3768 7079 6933 8451
3768 7079 6933 8448
3768 7079 6933 8447.4
3768 7079 6933 8447
3768 7079 6933 8447
3768 7079 6933 8445.86667
;;;;
run;
data two;
input (d_201409-d_201412) (:best.);
row_label = _n_-1;
datalines;
. 1 . 1
. 4 . 1
. . . .
. . 1 .
1 . . 1
. . 1 1
;;;;
run;
%let vars=d_201409-d_201412;
data one_div_two;
set one; *dividend first, then divisor;
array vars &vars.;
array tempstore[1000] _temporary_; *1000 is arbitrary, something large as or larger than &vars size;
array tempsums[1000] _temporary_; *same idea;
do _i = 1 to dim(vars); *store dividend in temporary array;
tempstore[_i] = vars[_i];
end;
set two;
do _i = 1 to dim(vars); *now we divide original number by stored dividend;
vars[_i] = 1000 * divide(vars[_i],tempstore[_i]);
tempsums[_i] = sum(tempsums[_i],vars[_i]); *store away the running total;
vars[_i] = tempsums[_i]; *and retrieve it;
end;
run;
proc means data=one_div_two noprint; *and we summarize. Could do this in the above step too, but easier here.;
var &vars.;
output out=id_Summ sum=;
run;
I am trying to find a quick way to replace missing values with the average of the two nearest non-missing values. Example:
Id Amount
1 10
2 .
3 20
4 30
5 .
6 .
7 40
Desired output
Id Amount
1 10
2 **15**
3 20
4 30
5 **35**
6 **35**
7 40
Any suggestions? I tried using the retain function, but I can only figure out how to retain last non-missing value.
I thinks what you are looking for might be more like interpolation. While this is not mean of two closest values, it might be useful.
There is a nifty little tool for interpolating in datasets called proc expand. (It should do extrapolation as well, but I haven't tried that yet.) It's very handy when making series of of dates and cumulative calculations.
data have;
input Id Amount;
datalines;
1 10
2 .
3 20
4 30
5 .
6 .
7 40
;
run;
proc expand data=have out=Expanded;
convert amount=amount_expanded / method=join;
id id; /*second is column name */
run;
For more on the proc expand see documentation: https://support.sas.com/documentation/onlinedoc/ets/132/expand.pdf
This works:
data have;
input id amount;
cards;
1 10
2 .
3 20
4 30
5 .
6 .
7 40
;
run;
proc sort data=have out=reversed;
by descending id;
run;
data retain_non_missing;
set reversed;
retain next_non_missing;
if amount ne . then next_non_missing = amount;
run;
proc sort data=retain_non_missing out=ordered;
by id;
run;
data final;
set ordered;
retain last_non_missing;
if amount ne . then last_non_missing = amount;
if amount = . then amount = (last_non_missing + next_non_missing) / 2;
run;
but as ever, will need extra error checking etc for production use.
The key idea is to sort the data into reverse order, allowing it to use RETAIN to carry the next_non_missing value back up the data set. When sorted back into the correct order, you then have enough information to interpolate the missing values.
There may well be a PROC to do this in a more controlled way (I don't know anything about PROC STANDARDIZE, mentioned in Reeza's comment) but this works as a data step solution.
Here's an alternative requiring no sorting. It does require IDs to be sequential, though that can be worked around if they're not.
What it does is uses two set statements, one that gets the main (and previous) amounts, and one that sets until the next amount is found. Here I use the sequence of id variables to guarantee it will be the right record, but you could write this differently if needed (keeping track of what loop you're on) if the id variables aren't sequential or in an order of any sort.
I use the first.amount check to make sure we don't try to execute the second set statement more than we should (which would terminate early).
You need to do two things differently if you want first/last rows treated differently. Here I assume prev_amount is 0 if it's the first row, and I assume last_amount is missing, meaning the last row just gets the last prev_amount repeated, while the first row is averaged between 0 and the next_amount. You can treat either one differently if you choose, I don't know your data.
data have;
input Id Amount;
datalines;
1 10
2 .
3 20
4 30
5 .
6 .
7 40
;;;;
run;
data want;
set have;
by amount notsorted; *so we can tell if we have consecutive missings;
retain prev_amount; *next_amount is auto-retained;
if not missing(amount ) then prev_amount=amount;
else if _n_=1 then prev_amount=0; *or whatever you want to treat the first row as;
else if first.amount then do;
do until ((next_id > id and not missing(next_amount)) or (eof));
set have(rename=(id=next_id amount=next_amount)) end=eof;
end;
amount = mean(prev_amount,next_amount);
end;
else amount = mean(prev_amount,next_amount);
run;
original output
Count
AAB BB
01NOV2014 5 4
02NOV2014 4 3
But ideal output is
Count
BB AAB
01NOV2014 4 5
02NOV2014 4 4
Is there a way to change a n by k tables from proc tabulate to list it as requested?
Since k is not small, I'm looking for an efficient way to achieve this. Maybe store the requested order in a macro variable?
The easiest answer depends on how the order is derived.
You have some ordering options on the class variable, such as order=data, which may give you the desired result if the data is stored in that order. This can be tricky, but sometimes is a simple method to get to that result.
Second, you have a couple of options related to formats.
If the data can be stored as a formatted numeric, where BB=1, AAB=2, etc., then use order=unformatted to achieve that.
Create a format that lists the values in order, just formatting them to themselves, with notsorted in the options of the value statement, and then use order=data on the class statement and preloadfmt.
Example of the second option:
data have;
input var $ count;
datalines;
AAA 1
AAB 2
BBA 3
BBB 4
;;;;
run;
proc format;
value $myformatf (notsorted)
BBB=BBB
AAB=AAB
BBA=BBA
AAA=AAA
other=' ';
quit;
proc tabulate data=have;
class var/order=data preloadfmt;
format var $myformatf.;
var count;
tables var,count*sum;
run;
Another question. I have multiple data sets that generate ouput how can output these into one excel work sheet and apply my own formating. For example I have data set 1, data set 2, data set 3
each data set has two coloumns, for example
Col 1 Col 2
1 2
3 4
5 6
I want each data set to be in one worksheet and seperated by column , so in excel it should look like
Col 1 Col 2 Blank Col Col 1 Col 2 Blank Col
Somone told me I need to look at DDE for this is this true
Regards,
You can definitely do it using DDE. What DDE does it just simulates user's clicks at Excel's menus, buttons, cells etc. Here's an example how you can do that with macro loop for 3 datasets with names have1, have2 and have3. If you need more general solution (unknown number of datasets, with various number of variables, random datasets' names etc), the code should be updated, but its 'DDE-part' will be essentially pretty the same.
One more assumption - your Excel workbook should be open during code execution. Though it can be also automated - Excel can be started and file can be open using DDE itself.
You can find a very nice introduction into DDE here, where all these trick discussed in details.
data have1;
input Col1 Col2;
datalines;
1 2
3 4
5 6
;
run;
data have2;
input Col1 Col2;
datalines;
1 2
3 4
5 6
7 8
;
run;
data have3;
input Col1 Col2;
datalines;
1 2
3 4
7 8
5 6
9 10
;
run;
%macro xlsout;
/*iterating through your datasets*/
%do i=1 %to 3;
/*determine number of records in the current dataset*/
proc sql noprint;
select count(*) into :noobs
from have&i;
quit;
/*assign a range on the workbook spreadsheet matching to data in the current dataset*/
filename range dde "excel|[myworkbook.xls]sas!r1c%eval((&i-1)*3+1):r%left(&noobs)c%eval((&i-1)*3+2)" notab;
/*put data into selected range*/
data _null_;
set have&i;
file range;
put Col1 '09'x Col2;
run;
%end;
%mend xlsout;
%xlsout
You cannot do exactly this with SAS (DDE is probably possible). I would suggest looking at SaviCells Pro.
http://www.sascommunity.org/wiki/SaviCells
http://www.savian.net/utilities.html
You could likely accomplish what you're asking through ODS TAGSETS.EXCELXP or the new ODS EXCEL (9.4 TS1M1). You would need to arrange the datasets ahead of time (ie, merge them together or transpose or whatnot to get one dataset with the right columns), however, or else use PROC REPORT or some other procedure to get them in the right format.
I have data that looks like this:
id t x
1 1 3.7
1 3 1.2
1 4 2.4
2 2 6.0
2 4 6.1
2 5 6.2
For each id I want to add observations as necessary so there are values for all 1<=t<=5.
So my desired result is:
id t x
1 1 3.7
1 2 .
1 3 1.2
1 4 2.4
1 5 .
2 1 .
2 2 6.0
2 3 .
2 4 6.1
2 5 6.2
My real setting involves massive amounts of data, so I'm looking for the most efficient way to do this.
Here's probably the simplest way, using the COMPLETETYPES option in PROC SUMMARY. I'm making the assumption that the combinations of id and t are unique in the data.
The only thing I'm not sure of is whether you'll run into memory issues when running against a very large dataset, I have had problems with PROC SUMMARY in this respect in the past.
data have;
input id t x;
cards;
1 1 3.7
1 3 1.2
1 4 2.4
2 2 6.0
2 4 6.1
2 5 6.2
;
run;
proc summary data=have nway completetypes;
class id t;
var x;
output out=want (drop=_:) max=;
run;
One option is to use PROC EXPAND, if you have ETS. I'm not sure if it'll do 100% of what you want, but it might be a good start. It seems like so far the main problem is it won't do records at the start or the end, but I think that's surmountable; just not sure how.
proc expand data=have out=want from=daily method=none extrapolate;
by id;
id t;
run;
That fills in 2 for id 1 and 3 for id 2, but does not fill in 5 for id 1 or 1 for id 2.
To do it in base SAS, you have a few options. PROC FREQ with the SPARSE option might be a good option.
proc freq data=have noprint;
tables id*t/sparse out=want2(keep=id t);
run;
data want_fin;
merge have want2;
by id t;
run;
You could also do this via PROC SQL, with a join to a table with the possible t values, but that seems slower to me (even though the FREQ method requires two passes, FREQ will be pretty fast and the merge is using already sorted data so that's also not too slow).
Here's another approach, provided that you already know the minimum/maximum values for T. It creates a template that contains all values of ID and T, then merges with the original data set so that you keep the values of X.
proc sort data=original_dataset out=template(keep=id) nodupkey;
by id;
run;
data template;
set template;
do t = 1 to 5; /* you could make these macro variables */
output;
end;
run;
proc sort data=original_dataset;
by id t;
run;
data complete_dataset;
merge template(in=in_template) original_dataset(in=in_original);
by id t;
if in_template then output;
run;