This is a follow-up question to this: Cleaner way of handling addition of summarizing rows to table?
However, this time we've got something slightly different I'm afraid. We have: a dataset with four independent variables, and two dependent variables. We want:
a) a set distinct by three independent variables, with a count of distinct variable #4 and sums of variables #5 & #6. This is easy.
b) TOTAL entries created for each combination of the three independent variables, with a count of distinct variable #4 and sums of variables #5 and #6. This is not easy (to me).
So the idea would be to modify this:
proc means data=have;
class ind1 ind2 ind3;
var dependent5 dependent6;
output out=want sum=;
run;
Such that it additionally could count the number of distinct values of variable #4, for each combination of variables #1,2, and 3, including ALL.
Ideas I've had:
1) Abandon hope all ye who entered here; go back to trying to do this in proc sql with a bunch of macro code, which allows you to use the count(distinct )) useful thang.
2) Something with nlevels?
3) Using convoluted, terrible macro code to generate however many hashes would be necessary to handle the uniques.
4) ??
Creating sample data here is kind of tricky; let me know if this makes no sense and I'll do my best to come up with some.
--
Edit: to be clear, the reason SQL queries require macro code (and would be slow) is because it would need to be something like the following (but expanded to many more levels)
select
"ALL" as ind1,
ind2
...
group by ind1, ind2;
select
ind1,
"ALL" as ind2
...
group by ind1, ind2;
select
"ALL" as ind1,
"ALL" as ind2
...
group by ind1, ind2;
This will get very unwieldly as we add more and more independent variables.
Did you Try This :
Proc Sql;
Count(distinct #4) as var_4, sum(#5) as var_5, sum(#6) as var_6
from have
group by #1,#2,#3
order by #1,#2,#3;
quit;
I don't think there's a clean way to use PROC MEANS to do this directly. What I might do is a two step process. First PROC MEANS to generate the total rows and the two sums, and then join that back to the original table in PROC SQL to get the distinct - when you do the join, it will neatly give you the right ones including the total row.
You might be able to use PROC REPORT to do this as well, but I'm not sure it would be easier than the two step solution.
Related
I'm very new to SAS. I spent some time looking for a solution for my problem. Unfortunately, I couldn't find any. I'd very much appreciate your help. The problem is actually quite simple.
I have two different datasets (let's call it dat1 and dat2) of different length. Moreover, they both have a variable X, which I'm interested in. I'm looking for a way to find common values of these two columns (let's call them dat1_X and dat2_X). The dataset is quite big though, approximately 10 million observations.
There are many ways!
If one dataset was small, and efficiency was an issue, you could consider using hash tables (or formats) to perform a lookup whilst passing through the larger one.
Otherwise, the following SQL approaches will do the trick (try testing to see which is most efficient):
/* correlated subquery (generally slow) */
proc sql;
create table want1 as
select distinct x
from dat1_x
where x in (select from dat2_x);
/* inner join */
proc sql;
create table want2 as
select distinct a.x
from dat1_x a, dat2_x b
where a.x=b.x;
/* intersect */
proc sql;
create table want3 as
select x
from dat1_x
intersect
select x
from dat2_x;
How can I produce a table that has this kind of info for multiple variables:
VARIABLE COUNT PERCENT
U 51 94.4444
Y 3 5.5556
This is what SAS spits out into the listing output for all variables when I run this program:
ods output nlevels=nlevels1 OneWayFreqs=freq1 ;
proc freq data=sample nlevels ;
tables _character_ / out=outfreq1;
run;
In the outfreq1 table there is the info for just the last variable in the data set (table shown above) but not for all for the variables.
In the nlevels1 table there is info of how many categories each variable has but no frequency data.
What I want though is to output the frequency info for all the variables.
Does anybody know a way to do this without a macro/loop?
You basically have two options, which are sort-of-similar in the kinds of problems you'll have with them: use PROC TABULATE, which more naturally deals with multiple table output, or use the onewayfreqs output that you already call for.
The problem with doing that is that variables may be of different types, so it doesn't have one column with all of that information - it has a pair of columns for each variable, which obviously gets a bit ... messy. Even if your variables are all the same type, SAS can't assume that as a general rule, so it won't produce a nice neat thing for you.
What you can do, though, particularly if you are able to use the formatted values (either due to wanting to, or due to them being identical!), is coalesce them into one result.
For example, given your freq1 dataset from the above:
data freq1_out;
set freq1;
value = coalesce(of f_:);
keep table value frequency percent;
run;
That combines the F_ variables into one variable (as always only one is ever populated). If you can't use the F_ variables and need the original ones, you will have to make your own variable list using a macro variable list (or some other method, or just type the names all out) to use coalesce.
Finally, you could probably use PROC SQL to produce a fairly similar table, although I probably wouldn't do it without using the macro language. UNION ALL is a handy tool here; basically you have separate subqueries for each variable with a group by that variable, so
proc sql;
create table my_freqs as
select 'HEIGHT' as var, height, count(1) as count
from sashelp.class
group by 1,height
union all
select 'WEIGHT' as var, weight, count(1) as count
from sashelp.class
group by 1,weight
union all
select 'AGE' as var, age, count(1) as count
from sashelp.class
group by 1,age
;
quit;
That of course can be trivially macrotized to something like
proc sql;
create table my_freqs as
%freq(table=sashelp.class,var=height)
union all
%freq(table=sashelp.class,var=weight)
union all
%freq(table=sashelp.class,var=age)
;
quit;
or even further either with a list processing or a macro loop.
I have a dataset structured with an ID and two other variables.
The id is not unique, it appears in the dataset more than 1 time (a patient could receive more than one clinical treatment).How can I drop the entire observation (the entire line) only if it is a perfect clone of a previous observation (based on the other two variable values)? I don't want to use an insanely long if statement.
Thanks.
proc sql;
select distinct * from olddata;
quit;
Sounds like an easy SQL fix. The select distinct option will remove any completely duplicate rows in a dataset if you select all columns.
If you specifically want to identify if two consecutive lines are identical (but are not looking to match identical lines separated by other lines), you can use notsorted on a by statement and then first and last variables.
data want;
set have;
by id var1 var2 notsorted;
if first.var2;
run;
That will keep the first record for any group of identical id/var1/var2, so long as they're consecutive on the dataset. Of course if you sort the dataset by id var1 var2 first this will always remove the duplicates, but not sorted this still works for removing consecutive pairs (or more) that are collocated.
I prefer #JJFord's answer, but for the sake of completeness, this can also be done using the nodupe option in proc sort:
proc sort data=mydata nodupe;
by id;
run;
What you choose as the by variable doesn't really matter here. The important bit is just to specify the nodupe option.
I have a rather simplistic question.
Is there any way to merge $n$ data sets in SAS where $n > 2$. I know how to merge 2 data sets.
Thanks
gowers
You can merge multiple data sets using the same syntax as for just two:
data all;
merge ds1 ds2 ds3 ...;
by some_list_of_variables;
run;
If you have many data sets you want to merge, you may want to right a macro that lists them all.
In addition to the code #itzy provided, you can identify your data sets using an IN= option on the MERGE statement. This allows you only accept the matching you need. Also, you must have common variable names to use in your BY statement. You can include a RENAME= statement to create a common variable for use in your BY statement.
(Untested code)
data all;
merge ds1(in=one rename=(ds1_id=id))
ds2(in=two rename=(ds2_id=id))
ds3(in=three rename=(ds3_id=id))
;
by some_list_of_variables;
if one and two and three ; /* Creates only matching records from all */
run;
Even though you have said that you want to "merge" datasets, note that the MERGE statement is not the only option. If your merging key has duplicates in more than 1 dataset, then using the MERGE statement will might give logically wrong results even though it would work without complaining. In that case, you can use PROC SQL - I also recall that PROC SQL can be more efficient from SAS 9.1 onwards.
Example -
proc sql;
select <fieldlist>
from data1 t1, data2 t2, data3 t3, data4 t4
where <join condition>;
quit;
I have tried the following options with unacceptable response times - creating index 'key' did not help either (NOTE:duplicate'keys'in both datasets):
data a;
merge b
c;
by key
if b;
run;
=== OR ===
proc sql;
create a
as select *
from b
left outer join c
on b.key;
quit;
You should first sort the two datasets before merging them. This is what will give the performance. using an index when you have to scan the whole table to have a result is usually slower then presorting the datasets and merging them.
Be sure to trim your dataset as much as possible. Sort your dataset before the data step or proc sql. Also, I'm not 100% if it matters, but ANSI SQL would be proc sql; create a as select * from b left outer join c on b.key=C.KEY; quit;
Creating a key is the fastest way to get the datasets ready for joining. Sorting them can take just as long as merging them, if not longer, but is still a good idea.
AFHood's suggestion of trimming them is a good. Can you possibly run them through a PROC SUMMARY? Are there any columns you can drop? Any of these will reduce the size of your dataset and make the merging faster.
None of these methods may work however. I routinely merge files of several million rows and it can take a while.
You might try the SQL merge. I don't know if it would be faster for your needs but I've found SQL to be much more efficient than a regular SAS merge. Plus, once you realize what you can do with SQL, manipulating datasets becomes easier!
Don't use a data step merge to do this.
With duplicate keys in both datasets the result will be wrong.
The only way to do this is with a
Proc SQL;
Create table newdata
as select firsttable.aster, secondtable.aster
from table1 as firsttable
inner join table2 as secondtable
on (firstable.keyfield = secondtable.keyfield);
quit;
If you have more than one keyfield the join order should be least match field first to greatest match field last. SAS has a bad habbit of creating a tempory dataset containing all possible matches and the siveing it down from there. Can blow out your tempory space allocation and slow everyting down.
If you still wish to use a DATA Step then you need to get rid of the duplicate keys out of one of the datasets.