I have a sas data set with 564 variables. I need to create a new table with three columns, column1 will be the variables name , column2 will be the value of that variable, and column3 will be observation number.
So if I have a variable gender then gender will be listed 2 times in the variable column, and the gender values will be listed as m in the first row and female in the second row, and the column3 will be just the number of the observation. This is how it should look. Many thanks in advance.
var value obs
gender m 1
gender f 2
ans yes 3
ans no 4
The key to getting a table out of PROC FREQ that has all the values in one column is ods output combined with coalescec.
ODS OUTPUT lets you tell PROC FREQ to put everything into one dataset (as opposed to out= which just puts one Freq table into one dataset). That gives you a slightly messy result, which we then use coalescec to fix. That function takes a list of variables and returns the first nonmissing value from them; since the F_ variables always have just one value populated (the formatted value of the variable in the table), it's easy to use them.
ods output onewayfreqs=freqs;
proc freq data=sashelp.class;
tables age sex;
run;
ods output close; *technically unneeded but makes it more clear;
data want;
set freqs;
value = left(coalescec(of f_:));
run;
The rest of what you note above is trivial from that dataset.
Related
Hi,
Can someone explain to me what a given code sequence does step by step?**
I must describe it in detail what is happening in turn
%macro frequency_encoding(dataset, var);
proc sql noprint;
create table freq as
select distinct(&var) as values, count(&var) as number
from &dataset
group by Values ;
create table new as select *, round(freq.number/count(&var),00.01) As freq_encode
from &dataset left join freq on &var=freq.values;
quit;
data new(drop=values number &var);
set new;
rename freq_encode=&var;
run;
data new;
set new;
keep &var;
run;
data dane(drop = &var);
set dane;
run;
data dane;
set dane;
set new;
run;
The SQL is first finding the frequency of each value of the variable. Then it divides those counts by the total number of non-missing values and rounds that percentage to two decimal places (or integers when you think of the ratio as a percentage).
This could be done in one step with:
proc sql noprint;
create table new as
select *,round(number/count(&var),0.01) as freq_encode
from (select *,&var as values,count(&var) as number
from &dataset
group by &var
)
;
quit;
It is not clear what the DANE dataset is supposed to be. If &DATESET does not equal DANE then those last four data steps make no sense. If it does then it is a convoluted way to replace the original variable with the percentage.
The first one is basically trying to rename the calculated percentage as the original variable and eliminate the original variable and the other two intermediate variables used in calculating the percentage.
The second one is dropping all of the variables except the new percentage.
The third one is dropping the original variable from "dane".
The last one is adding the new variable back to "dane".
Assuming DANE should be replaced with &DATASET then those four data steps could be reduced to one:
data &dataset;
set &dataset(drop=&var);
set new(keep=freq_encode rename=(freq_encode=&var));
run;
It is probably best not to overwrite your original dataset in that way. So perhaps you should add an OUT parameter to your macro to name to new dataset you want to create.
You could have avoided all of those data steps by just adding the DROP= and RENAME= dataset options to the dataset generated by the SQL query.
So perhaps you want something like this:
%macro frequency_encoding(dataset, var,out);
proc sql noprint;
create table &out(drop=&var number rename=(freq_encode=&var)) as
select *,round(number/count(&var),0.01) as freq_encode
from (select *,count(&var) as number
from &dataset
group by &var
)
;
quit;
%mend ;
%frequency_encoding(sashelp.class,sex,work.class);
I'm doing a simple count of occurrences of a by-variable within a class variable, but cannot find a way to rename the total count across class variables. At the moment, the output dataset includes counts for all cluster2 within each group as well as the total count across all groups (i.e. the class variable used). However, the counts within classes are named, while the total is shown by an empty string.
Code:
proc means data=seeds noprint;
class group;
by cluster2;
id label2;
output out=seeds_counts (drop= _type_ _freq_) n(id)=count;
run;
Example of output file:
cluster2 group label2 count
7 area 1 20
7 sa area 1 15
7 sb area 1 5
15 area 15 42
15 sa area 15 18
....
Naturally, renaming the emtpy string to "Total" could be accomplished in a separate datastep, but I would like to do it directly in the Proc Means-step. It should be simple and trivial, but I haven't found a way so far. Afterwards, I want to transpose the dataset, which means that the emtpy string has to be changed, or it will be dropped in the proc transpose.
I don't know of a way to do it directly, but you can sort-of-cheat: you can tell SAS to show "Total" instead of missing.
proc format;
value $MissTotalF
' ' = 'Total'
other = [$CHAR12.];
quit;
proc means data=sashelp.class noprint;
class sex;
id age;
output out=sex_counts (drop= _type_ _freq_) n(age)=count;
format sex $MissTotalF.;
run;
For example. I'd also recommend using PROC TABULATE instead of PROC MEANS if you're just going for counts, though in this case it doesn't really make much difference.
The problem here is that if the variable in the class statement is numeric, then the resultant column will be numeric, therefore you can't add the word Total (unless you use a format, similar to the answer from #Joe). This will be why the value is missing, as the class variable can be either numeric or character.
Here's an example of a numeric class variable.
proc sort data=sashelp.class out=class;
by sex;
run;
proc means data=class noprint;
class age;
by sex;
output out=class_counts (drop= _:) n=count;
run;
Using proc tabulate can display the result pretty much how you want it, however the output dataset will have the same missing values, so won't really help. Here's a couple of examples.
proc tabulate data=class out=class_tabulate1 (drop=_:);
class sex age;
table sex*(age all='Total'),n='';
run;
proc tabulate data=class out=class_tabulate2 (drop=_:);
class sex age;
table sex,age*n='' all='Total';
run;
I think the best option to achieve your final goal is to add the nway option to proc means, which will remove the subtotals, then transpose the data and finally write a data step that creates the Total column by summing each row. It's 3 steps, but doesn't involve much coding.
Here is one method you could use by taking advantage of the _TYPE_ variable so that you can process the totals and details separately. You will still have trouble with PROC TRANSPOSE if there is a class with missing values (separate from the overall summary record).
proc means data=sashelp.class noprint;
class sex;
id age;
output out=sex_counts (drop= _freq_ ) n(age)=count;
run;
proc transpose data=sex_counts out=transpose prefix=count_ ;
where _type_=1 ;
id sex ;
var count;
run;
data transpose ;
merge transpose sex_counts(where=(_type_=0) keep=_type_ count);
rename count=count_Total;
drop _type_;
run;
I have the following problem. I need to run PROC FREQ on multiple variables, but I want the output to all be on the same table. Currently, a PROC FREQ statement with something like TABLES ERstatus Age Race, InsuranceStatus; will calculate frequencies for each variable and print them all on separate tables. I just want the data on ONE table.
Any help would be appreciated. Thanks!
P.S. I tried using PROC TABULATE, but it didn't not calculate N correctly, so I'm not sure what I did wrong. Here is my code for PROC TABULATE. My variables are all categorical, so I just need to know N and percentages.
PROC TABULATE DATA = BCanalysis;
CLASS ERstatus PRstatus Race TumorStage InsuranceStatus;
TABLE (ERstatus PRstatus Race TumorStage) * (N COLPCTN), InsuranceStatus;
RUN;
The above code does not return the correct frequencies based on InsuranceStatus where 0 = insured and 1 = uninsured, but PROC FREQ does. Also doesn't calculate correctly with ROWPCTN. So any way that I can get PROC FREQ to calculate multiple variables on one table, or PROC TABULATE to return the correct frequencies, would be appreciated.
Here is a nice image of my output in a simplified analysis of only ERstatus and InsuranceStatus. You can see that PROC FREQ returns 204 people with an ERstatus of 1 and InsuranceStatus of 1. That's correct. The values in PROC TABULATE are not.
OUTPUT
I'll answer this separately as this is answering the other possible interpretation of the question; when it's clarified I'll delete one or the other.
If you want this in a single printed table, then you either need to use proc tabulate or you need to normalize your data - meaning put it in the form of variable | value. PROC FREQ is not capable of doing multiple one-way frequencies in a single table.
For PROC TABULATE, likely your issue is missing data. Any variable that is on the class statement will be checked for missingness, and if any rows are missing data for any of the class variables, those rows are entirely excluded from the tabulation for all variables.
You can override this by adding the missing option on the class statement, or in the table statement, or in the proc tabulate statement. So:
PROC TABULATE DATA = BCanalysis;
CLASS ERstatus PRstatus Race TumorStage InsuranceStatus/missing;
TABLE (ERstatus PRstatus Race TumorStage) * (N COLPCTN), InsuranceStatus;
RUN;
This will result in a slightly different appearance than on your table, though, as it will include the missing rows in places you probably do not want them, and they'll be factored against the colpctn when again you probably don't want them.
Typically some manipulation is then necessary; the easiest is to normalize your data and then run a tabulation (using PROC TABULATE or PROC FREQ, whichever is more appropriate; TABULATE has better percentaging options though) against that normalized dataset.
Let's say we have this:
data class;
set sashelp.class;
if _n_=5 then call missing(age);
if _n_=3 then call missing(sex);
run;
And we want these two tables in one table.
proc freq data=class;
tables age sex;
run;
If we do this:
proc tabulate data=class;
class age sex;
tables (age sex),(N colpctn);
run;
Then we get an N=17 total for both subtables - that's not what we want, we want N=18. Then we can do:
proc tabulate data=class;
class age sex/missing;
tables (age sex),(N colpctn);
run;
But that's not quite right either; I want F to have 8/18 = 44.44% and M 10/18 = 55.55%, not 42% and 53% with 5% allocated to the missing row.
The way I do this is to normalize the data. This means you get a dataset with 2 variables, varname and val, or whatever makes sense for your data, plus whatever identifier/demographic/whatnot variables you might have. val has to be character unless all of your values are numeric.
So for example here I normalize class with age and sex variables. I don't keep any identifiers, but you certainly could in your data, I imagine InsuranceStatus would be kept there if I understand what you're doing in that table. Once I have the normalized table, I just use those two variables, and carefully construct a denominator definition in proc tabulate to have the right basis for my pctn value. It's not quite the same as the single table before - the variable name is in its own column, not on top of the list of values - but honestly that looks better in my opinion.
data class_norm;
set class;
length val $2;
varname='age';
val=put(age,2. -l);
if not missing(age) then output;
varname='sex';
val=sex;
if not missing(sex) then output;
keep varname val;
run;
proc tabulate data=class_norm;
class varname val;
tables varname=' '*val=' ',n pctn<val>;
run;
If you want something better than this, you'll probably have to construct it in proc report. That gives you the most flexibility, but is the most onerous to program in also.
You can use ODS OUTPUT to get all of the PROC FREQ output to one dataset.
ods output onewayfreqs=class_freqs;
proc freq data=sashelp.class;
tables age sex;
run;
ods output close;
or
ods output crosstabfreqs=class_tabs;
proc freq data=sashelp.class;
tables sex*(height weight);
run;
ods output close;
Crosstabfreqs is the name of the cross-tab output, while one-way frequencies are onewayfreqs. You can use ods trace to find out the name if you forget it.
You may (probably will) still need to manipulate this dataset some to get the structure you want ultimately.
I have a dataset with many columns like this:
ID Indicator Name C1 C2 C3....C90
A 0001 Black 0 1 1.....0
B 0001 Blue 1 0 0.....1
B 0002 Blue 1 0 0.....1
Some of the IDs are duplicates because the indicator is different, but they're essentially the same record. To find duplicates, I want to select distinct ID, Name and then C1 through C90 to check because some claims who have the same Id and indicator have different C1...C90 values.
Is there a way to select c1...c90 either through proc sql or a sas data step? It seems the only way I can think of is to set the dataset and then drop the non essential columns, but in the actual dataset, it's not only Indicator but at least 15 other columns.
It would be nice if PROC SQL used the : variable name wildcard like other Procs do. When no other alternative is reasonable, I usually use a macro to select bulk columns. This might work for you:
%macro sel_C(n);
%do i=1 %to %eval(&n.-1);
C&i.,
%end;
C&n.
%mend sel_C;
proc sql;
select ID,
Indicator,
Name,
%sel_C(90)
from have_data;
quit;
If I understand the question properly, the easiest way would be to concatenate the columns to one. RETAIN that value from row to row, and you can compare it across rows to see if it's the same or not.
data want;
set have;
by id indicator;
retain last_cols;
length last_cols $500;
cols = catx('|',of c1-c90);
if first.id then call missing(last_cols);
else do;
identical = (cols = last_cols); *or whatever check you need to perform;
end;
output;
last_cols = cols;
run;
There are a few different ways you can do this and it will be much easier if the actual column names are C1 - C90. If you're just looking to remove anything that you know is a duplicate you can use proc sort.
proc sort data=dups out=nodups nodupkey;
by ID Name C1-C90;
run;
The nodupkey option will automatically remove any duplicates in the by statement.
Alternatively, if you want to know which records contain duplicates, you could use proc summary.
proc summary data=dups nway missing;
class ID Name C1-C90;
output out=onlydups(where=(_freq_ > 1));
run;
proc summary creates two new variables, _type_ and _freq_. If you specify _freq_ > 1 you will only output the duplicate records. Also, note that this will remove the Indicator variable.
I'd like to get a frequency table that lists all variables, but only tells me the number of times "-2", "-1" and "M" appear in each variable.
Currently, when I run the following code:
proc freq data=mydata;
tables _ALL_
/list missing;
I get one table for each variable and all of its values (sometimes 100s). Can I just get tables with the three values I want, and everything else suppressed?
You can do this a number of ways.
First off, you probably want to do this to a dataset first to allow you to filter that dataset. I would use PROC TABULATE, but you can use PROC FREQ if you like it better.
*make up some data;
data mydata;
call streaminit(132);
array x[100];
do _i = 1 to 50;
do _t = 1 to dim(x);
x[_t]= floor(rand('Uniform')*9-5);
end;
output;
end;
keep x:;
run;
ods _all_ close; *close the 'visible' output types;
ods output onewayfreqs=outdata; *output the onewayfreqs (one way frequency tables) to a dataset;
proc freq data=mydata;
tables _all_/missing;
run;
ods output close; *close the dataset;
ods preferences; *open back up your default outputs;
Then filter it, and once you've done that print it however you want. Note in the PROC FREQ output, you get a column for each different variable - not super helpful. The F_ variables are the formatted values, which can then be combined using coalesce. I assume here they're all numeric variables - define f_val as character and use coalescec if there are any character variables or variables with character-ish formats applied to them.
data has_values;
set outdata;
f_val = coalesce(of f_:);
keep table f_val frequency percent;
if f_val in (0,-1,-2);
run;
The last line keeps only the 0,-1,-2.