Hi,
Can someone explain to me what a given code sequence does step by step?**
I must describe it in detail what is happening in turn
%macro frequency_encoding(dataset, var);
proc sql noprint;
create table freq as
select distinct(&var) as values, count(&var) as number
from &dataset
group by Values ;
create table new as select *, round(freq.number/count(&var),00.01) As freq_encode
from &dataset left join freq on &var=freq.values;
quit;
data new(drop=values number &var);
set new;
rename freq_encode=&var;
run;
data new;
set new;
keep &var;
run;
data dane(drop = &var);
set dane;
run;
data dane;
set dane;
set new;
run;
The SQL is first finding the frequency of each value of the variable. Then it divides those counts by the total number of non-missing values and rounds that percentage to two decimal places (or integers when you think of the ratio as a percentage).
This could be done in one step with:
proc sql noprint;
create table new as
select *,round(number/count(&var),0.01) as freq_encode
from (select *,&var as values,count(&var) as number
from &dataset
group by &var
)
;
quit;
It is not clear what the DANE dataset is supposed to be. If &DATESET does not equal DANE then those last four data steps make no sense. If it does then it is a convoluted way to replace the original variable with the percentage.
The first one is basically trying to rename the calculated percentage as the original variable and eliminate the original variable and the other two intermediate variables used in calculating the percentage.
The second one is dropping all of the variables except the new percentage.
The third one is dropping the original variable from "dane".
The last one is adding the new variable back to "dane".
Assuming DANE should be replaced with &DATASET then those four data steps could be reduced to one:
data &dataset;
set &dataset(drop=&var);
set new(keep=freq_encode rename=(freq_encode=&var));
run;
It is probably best not to overwrite your original dataset in that way. So perhaps you should add an OUT parameter to your macro to name to new dataset you want to create.
You could have avoided all of those data steps by just adding the DROP= and RENAME= dataset options to the dataset generated by the SQL query.
So perhaps you want something like this:
%macro frequency_encoding(dataset, var,out);
proc sql noprint;
create table &out(drop=&var number rename=(freq_encode=&var)) as
select *,round(number/count(&var),0.01) as freq_encode
from (select *,count(&var) as number
from &dataset
group by &var
)
;
quit;
%mend ;
%frequency_encoding(sashelp.class,sex,work.class);
Related
I have a do loop in which I do calculation on new variable and results are stored as additional column, this column-s (at each iteration) should be attached to the output table defined by macro.
Here on SO something similar has been asked but the answer is not acceptable, the last answer is not compatible with sas command but very close, getting incomplete script with following:
proc sql;
update &outlib..&out.
set var._iqr = b.&var._iqr
from &outlib..&out. as a
left join cal_resul as b
on a.id_client=b.id_client
and a.reference_date=b.reference_date;
quit;
Here is my attempt which works but very slow:
proc sql; create table &outlib..&out. as select * from &inlib..&in.; quit; /* the input is as a basis for output table */
proc sql; alter table &outlib..&out. add &var._iqr numeric; quit; /* create empty column to be filled at each iteration */
proc sql;
update &outlib..&out. as a
set &var._iqr=(select b.&var._iqr from cal_resul as b
where a.id_client=b.id_client
and a.reference_date=b.reference_date
and a.data_source=b.data_source);
quit;
Attempt 2:
This is somewhat faster:
proc sort data=cal_resul; by id_client reference_date data_source; run;
data &outlib..&out.;
update &outlib..&out. cal_resul;
by id_client reference_date data_source;
run;
Simple left join (adding new column into existing table is way faster) but with left join I did not figure out how I can update (always retain the same dataset) the &outlib..&out. at each iteration. Many thanks for any help;
If you want to ADD a variable to a dataset you will have to make a new dataset. (Your ALTER TABLE statement will create a new dataset and copy over all of the observations.)
Looks like your data has three key variables. So use those in merging the new data to the old.
For example to make a new variable in HAVE named EXAMPLE_IQR using the variable EXAMPLE in the dataset NEW you could use code like this. I have used macro variables to show how you might use those macro variables as the parameters to a macro. It sounds like you don't want the process to add new observations to the existing dataset so I have added a check for that using the IN= dataset option.
%let base=work.have;
%let indata=work.new;
%let var=example;
data &base ;
merge &base(in=inbase)
&indata(keep=id_client reference_date data_source &var
rename=(&var=&var._iqr)
)
;
by id_client reference_date data_source;
if inbase;
run;
I want to count the number of records in a dataset in SAS. There is a function the make this thing in a simple way? I used R ed for obtain this information there was the length() function. Morover I need the number of record to compute some percetages so I need this value not in a table but in a value that can be used for other data step. How can I fix?
Thanks in advance
Here is another solution, using SAS dictionaries,
proc sql;
select nobs into: num_obs
from dictionary.tables
where libname = "WORK" and memname = "A"
;
quit;
It is easy to get the size of many datasets by modifying the above code,
proc sql;
create table test as
select memname, nobs
from dictionary.tables
where libname = "WORK" and memname like "A%"
;
quit;
data _null_;
set test;
call symput(memname, nobs);
run;
The above code will give you the sizes of all data sets with name starting with "a" in the temporary/work library.
Assuming this is a basic SAS table that you've created, and not modified or appended to, the best way is to use the meta data held in a dataset (the Number of tries is held in a piece of meta data called "nobs"), without reading through the dataset its self and place it in a macro variable. You can do this in the following way:
Data _null_;
i=1;
If i = 0 then set DATASETTOCOUNT nobs= mycount;
Call symput('mycount', mycount);
Run;
%put &mycount.;
You will now have a macro variable that contains the number of rows in your dataset, that you can call on in other data steps using &mycount.
I have a file with 10 obs. and different parameters. I need to add to my data a new variable of 'ID' for each observation- i.e a column of numbers 1-10.
How can I add a variable that is simply equal to the obs column?
I thought about doing it with a loop, define an empty vat, run over the var and each time add '1' to previous observation, however, it seems kind of complicated. Is there a better way to do it?
You can use the Data Step automatic variable _n_. This is the iteration count of the Data Step loop.
Data want;
set have;
ID = _n_;
run;
If you opt for a Proc SQL solution, there are two ways:
1. Undocumented:
proc sql;
create table want as
select monotonic() as row, *
from sashelp.class
;
quit;
Documented:
ods listing close;
ods output sql_results=want;
proc sql number;
select * from sashelp.class;
quit;
ods listing;
#DomPazz answer would definitely work! Just in case you would like return the number of observations according to attributes, Try this:
proc sort data= dataset out= sort_data;
by * your attribute(s) *;
data sort_data;
set sort_data;
by * your attribute(s) that is listed in above proc sort statement *;
if first.attribute then i=1; <=== first by group observation, number =1
i + 1; <==== sum statement (retaining)
if last.attribute and .... then ....; <=== whatever you want to do . Not necessary
run;
first / Last is very helpful in doing row operation.
I have a table with postings by category (a number) that I transposed. I got a table with each column name as _number for example _16, _881, _853 etc. (they aren't in order).
I need to do the sum of all of them in a proc sql, but I don't want to create the variable in a data step, and I don't want to write all of the columns names either . I tried this but doesn't work:
proc sql;
select sum(_815-_16) as nnl
from craw.xxxx;
quit;
I tried going to the first number to the last and also from the number corresponding to the first place to the one corresponding to the last place. Gives me a number that it's not correct.
Any ideas?
Thanks!
You can't use variable lists in SQL, so _: and var1-var6 and var1--var8 don't work.
The easiest way to do this is a data step view.
proc sort data=sashelp.class out=class;
by sex;
run;
*Make transposed dataset with similar looking names;
proc transpose data=class out=transposed;
by sex;
id height;
var height;
run;
*Make view;
data transpose_forsql/view=transpose_forsql;
set transposed;
sumvar = sum(of _:); *I confirmed this does not include _N_ for some reason - not sure why!;
run;
proc sql;
select sum(sumvar) from transpose_Forsql;
quit;
I have no documentation to support this but from my experience, I believe SAS will assume that any sum() statement in SQL is the sql-aggregate statement, unless it has reason to believe otherwise.
The only way I can see for SAS to differentiate between the two is by the way arguments are passed into it. In the below example you can see that the internal sum() function has 3 arguments being passed in so SAS will treat this as the SAS sum() function (as the sql-aggregate statement only allows for a single argument). The result of the SAS function is then passed in as the single parameter to the sql-aggregate sum function:
proc sql noprint;
create table test as
select sex,
sum(sum(height,weight,0)) as sum_height_and_weight
from sashelp.class
group by 1
;
quit;
Result:
proc print data=test;
run;
sum_height_
Obs Sex and_weight
1 F 1356.3
2 M 1728.6
Also note a trick I've used in the code by passing in 0 to the SAS function - this is an easy way to add an additional parameter without changing the intended result. Depending on your data, you may want to swap out the 0 for a null value (ie. .).
EDIT: To address the issue of unknown column names, you can create a macro variable that contains the list of column names you want to sum together:
proc sql noprint;
select name into :varlist separated by ','
from sashelp.vcolumn
where libname='SASHELP'
and memname='CLASS'
and upcase(name) like '%T' /* MATCHES HEIGHT AND WEIGHT */
;
quit;
%put &varlist;
Result:
Height,Weight
Note that you would need to change the above wildcard to match your scenario - ie. matching fields that begin with an underscore, instead of fields that end with the letter T. So your final SQL statement will look something like this:
proc sql noprint;
create table test as
select sex,
sum(sum(&varlist,0)) as sum_of_fields_ending_with_t
from sashelp.class
group by 1
;
quit;
This provides an alternate approach to Joe's answer - though I believe using the view as he suggests is a cleaner way to go.
I've been trying to make my code more efficient and this is the original code, but I think it can be written in one step.
data TABLE;set ORIGINAL_DATA;
Multi=percent*total_units;
keep Multi Type;
proc sort; by Type;
proc means noprint data=TABLE1; by Type; var Multi;output out=Table2(drop= _type_ _freq_)sum=Multi;run;
proc means noprint data=Table1; var Multi;output out=Table3(drop= _type_ _freq_) sum=total ;run;
proc sql;
create table TABLE4as
select a.Type, a.Multi label="Multi", b.total label="total"
from TABLE2 a, TABLE3 b
order by Type;
quit;
data TABLE5;set TABLE4;
pct=(MULTI/total)*100;
run;
I am able to split up part of it, but I can't figure out how to get the PCT part in my code. This is what I have.
proc sql;
create table TABLE1 as
select distinct type, sum(percent*total_units) as MULTI label "MULTI",
MULTI/(percent*total_units)) as PCT
from ORIGINAL_DATA
group by type;
quit;
I had to edit some of the code but I think the general idea should make sense.
The main problem is I cannot call upon the MULTI column because it is just being created but I want to create a percentage of the total for each type.
The "SAS" way to do something like this is to use a CLASS statement with PROC MEANS. That will calculate statistics on all the interaction levels in the data (identified by the TYPE variable). The row where TYPE=0 will be the "total" value, representing the value of that statistic for the entire data set.
In your case, we can take advantage of the fact that PROC MEANS will create the output data set sorted by TYPE and by the variables listed in the CLASS statement. That means we can just read the first observation and save it's value for calculating percentages.
It's probably easier to just show some code:
data TABLE;
set ORIGINAL_DATA;
Multi = percent * total_units;
keep Multi Type;
run;
proc means noprint data=TABLE;
class Type;
var multi;
output out=next sum=;
run;
data want;
retain total;
set next;
if _n_ = 1 then do;
/* The first obs will be the _TYPE_=0 record */
total = multi;
delete;
end;
pct = (multi / total) * 100;
drop total _freq_ _type_;
run;
Notice that you do not need to sort the data before using PROC MEANS. That's because we are using a CLASS statement rather than a BY statement. The data step is using the first observation in the data set created by MEANS (the TYPE=0 record) to retain the total sum of your variable. The delete statement keeps it out of the result.
CLASS statements with PROC MEANS are very useful. Take a few minutes to read up on how the TYPE variable is calculated, especially if you try using more than one class variable.
You can skip the initial data step by using the WEIGHT option in VAR statement of PROC MEANS (this will effectively do the multiplication for you). You can also use PROC TABULATE instead of PROC MEANS, as tabulate can calculate the percentage. I believe the following code will produce your required output in one go.
ods noresults;
proc tabulate data=have out=want (drop=_: rename=(total_units_sum=total total_units_pctsum_0=pct));
class type;
var total_units / weight=percent;
table type, total_units*(sum pctsum);
run;
ods results;
If you need one step, maybe this will work, but it's not actually efficient, since it processes data twice, once for detail by TYPE, once for total.
proc sql;
create table TABLE1 as
select
d.type
, sum(d.percent*d.total_units) as MULTI label "MULTI"
, calculated MULTI/s.total as PCT
from ORIGINAL_DATA d,
( select sum(percent*total_units) as total
from ORIGINAL_DATA) s
group by type
;
quit;
For more efficiency, but in more than one steps you could simply replace tables withe views in your original code:
data TABLE; => data TABLE / view=TABLE;
create table TABLE4 => create view TABLE4