Merging data sets in sas - sas

Suppose I have a dataset A:
ID Geogkey
1 A
1 B
1 C
2 W
2 R
2 S
and another dataset B:
ID Temp Date
1 95 1
1 100 2
1 105 3
2 10 1
How do I merge these two datasets so I get three records each for geogkeys with id=1 and one record each for geogkeys where id =2?

Assuming you want the cartesian join, you are best off doing that in SQL, if it's not too big:
proc sql;
create table C as
select * from A,B
where A.ID=B.ID
;
quit;
The select * will generate a warning that the ID variables are overwriting; if that's a concern, explicitly spell out your select (select A.ID, A.Geogkey, B.Temp, B.date).

Related

Summarise and calculate the items specifically in the dataset using proc sql

My dataset and attempt
data mydata;
input Category $ Item $;
datalines;
A 1
A 1
A 2
B 3
B 1
;
proc sql;
create table mytable as
select *, count(Category) as Total_No_in_Category, count(Category)-count(item, "3") as No_of_not_3_in_the_same_category from mydata
group by Category;
run;
Result
Category No_of_not_3_in_the_same_category Total_No_in_Category
A 3 3
A 3 3
A 3 3
B 2 2
B 2 1
My expected result
Category No_of_not_3_in_the_same_ category Total_No_in_Category
A 2 3
B 1 2
I wonder how to achieve the expected result using only proc SQL. Thank you so much.
The two argument COUNT(item, "3") function call is not an summary function. That causes all rows from original table to be automatically remerged with the aggregate computation (those count()). The remerge is a proprietary feature of SAS Proc SQL and not part of the ANSI Standard for SQL.
You appear to want the number of unique non-3 item values, so you will need a
COUNT(DISTINCT ...expression...)
in the query. The ...expression... can be a case clause that transforms item="3" to a null value by not having an else part of the case clause.
Example:
create table want as
select
category
, count(*) as freq
, count(distinct case when item ne "3" then item end) as n_unq_item_not_3
from mydata
group by category
;

show all values in categorical variable

The google search has been difficult for this. I have two categorical variables, age and months, with 7 levels each. for a few levels, say age =7 and month = 7 there is no value and when I use proc sql the intersections that do not have entries do not show, eg:
age month value
1 1 4
2 1 12
3 1 5
....
7 1 6
...
1 7 8
....
5 7 44
6 7 5
THIS LINE DOESNT SHOW
what i want
age month value
1 1 4
2 1 12
3 1 5
....
7 1 6
...
1 7 8
....
5 7 44
6 7 5
7 7 0
this happens a few times in the data, where tha last groups dont have value so they dont show, but I'd like them to for later purposes
You have a few options available, both seem to work on the premise of creating the master data and then merging it in.
Another is to use a PRELOADFMT and FORMATs or CLASSDATA option.
And the last - but possibly the easiest, if you have all months in the data set and all ages, then use the SPARSE option within PROC FREQ. It creates all possible combinations.
proc freq data=have;
table age*month /out = want SPARSE;
weight value;
run;
First some sample data:
data test;
do age=1 to 7;
do month=1 to 12;
value = ceil(10*ranuni(1));
if ranuni(1) < .9 then
output;
end;
end;
run;
This leaves a few holes, notably, (1,1).
I would use a series of SQL statements to get the levels, cross join those, and then left join the values on, doing a coalesce to put 0 when missing.
proc sql;
create table ages as
select distinct age from test;
create table months as
select distinct month from test;
create table want as
select a.age,
a.month,
coalesce(b.value,0) as value
from (
select age, month from ages, months
) as a
left join
test as b
on a.age = b.age
and a.month = b.month;
quit;
The group independent crossing of the classification variables requires a distinct selection of each level variable be crossed joined with the others -- this forms a hull that can be left joined to the original data. For the case of age*month having more than one item you need to determine if you want
rows with repeated age and month and original value
rows with distinct age and month with either
aggregate function to summarize the values, or
an indication of too many values
data have;
input age month value;
datalines;
1 1 4
2 1 12
3 1 5
7 1 6
1 7 8
5 7 44
6 7 5
8 8 1
8 8 11
run;
proc sql;
create table want1(label="Original class combos including duplicates and zeros for absent cross joins")
as
select
allAges.age
, allMonths.month
, coalesce(have.value,0) as value
from
(select distinct age from have) as allAges
cross join
(select distinct month from have) as allMonths
left join
have
on
have.age = allAges.age and have.month = allMonths.month
order by
allMonths.month, allAges.age
;
quit;
And a slight variation that marks duplicated class crossings
proc format;
value S_V_V .t = 'Too many source values'; /* single valued value */
quit;
proc sql;
create table want2(label="Distinct class combos allowing only one contributor to value, or defaulting to zero when none")
as
select distinct
allAges.age
, allMonths.month
, case
when count(*) = 1 then coalesce(have.value,0)
else .t
end as value format=S_V_V.
, count(*) as dup_check
from
(select distinct age from have) as allAges
cross join
(select distinct month from have) as allMonths
left join
have
on
have.age = allAges.age and have.month = allMonths.month
group by
allMonths.month, allAges.age
order by
allMonths.month, allAges.age
;
quit;
This type of processing can also be done in Proc TABULATE using the CLASSDATA= option.

SAS_Count Frequency

I want to ask a complicated (for me) question about SAS programming. I think I can explain better by using simple example. So, I have the following dataset:
Group Category
A 1
A 1
A 2
A 1
A 2
A 3
B 1
B 2
B 2
B 1
B 3
B 2
I want to count the each category for each group. I can do it by using PROC FREQ. But it is not better way for my dataset. It will be time consuming for me as my dataset is too large and I have a huge number of groups. So, if I use PROC FREQ, firstly I need to create new datasets for each group and then use PROC FREQ for each group. In sum, I need to create the following dataset:
CATEGORIES
Group 1 (first category) 2 3
A 3 2 1
B 2 3 1
So, the number of first category in group A is 3. The number of first category in group B is 2 and so on. I think I can explain it. Thanks for your helps.
There is more than one way to do this in SAS. My bias is proc sql, so:
proc sql;
select grp,
sum(case when category = 1 then 1 else 0 end) as cat_1,
sum(case when category = 2 then 1 else 0 end) as cat_2,
sum(case when category = 3 then 1 else 0 end) as cat_3
from t
group by grp;
Either proc freq or proc summary will do the job of producing frequency counts:
data example;
length group category $1;
input group category;
cards;
A 1
A 1
A 2
A 1
A 2
A 3
B 1
B 2
B 2
B 1
B 3
B 2
;
run;
proc freq data=example;
table group*category;
run;
proc summary data=example nway;
class group category;
output out=example_frequency (drop=_type_);
run;
proc summary will produce a dataset in a 'long' format. If you need to transpose it (I'd suggest not doing so: you'll probably find working with the long format easier in most circumstances) you can use proc transpose:
proc transpose data=example_frequency out=example_matrix (drop=_name_);
by group;
id category;
var _freq_;
run;

SAS_Working Two Dataset

I want to merge two data sets in SAS. I want to show by using example:
Group Value
A 10
A 8
A 6
B 7
B 9
B 11
it is my first data set. I have the second dataset as well:
Group Volume
A 2
B 3
I want to merge these two data sets. The result should be:
Group Value Volume
A 10 2
A 8 2
A 6 2
B 7 3
B 9 3
B 11 3
I hope, i can explain it. Many thanks.
Well one way is to use proc sql and just use a join:
proc sql noprint;
select a.*,b.volume
from dataset1 as a
left join dataset2 as b
on a.group = b.group;quit;
or if you want to do it with merge:
data combine;
merge dataset1 dataset2;
by group;
run;

Setting *most* variables to missing, while preserving the contents of a select few

I have a dataset like this (but with several hundred vars):
id q1 g7 q3 b2 zz gl az tre
1 1 2 1 1 1 2 1 1
2 2 3 3 2 2 2 1 1
3 1 2 3 3 2 1 3 3
4 3 1 2 2 3 2 1 1
5 2 1 2 2 1 2 3 3
6 3 1 1 2 2 1 3 3
I'd like to keep id, b2, and tre, but set everything else to missing. In a dataset this small, I can easily use call missing (q1, g7, q3, zz, gl, az) - but in a set with many more variables, I would effectively like to say call missing (of _ALL_ *except ID, b2, tre*).
Obviously, SAS can't read my mind. I've considered workarounds that involve another data step or proc sql where I copy the original variables to a new ds and merge them back on post, but I'm trying to find a more elegant solution.
This technique uses an un-executed set statement (compile time function only) to define all variables in the original data set. Keeps the order and all variable attributes type, labels, format etc. Basically setting all the variables to missing. The next SET statement which will execute brings in only the variables the are NOT to be set to missing. It doesn't explicitly set variables to missing but achieves the same result.
data nomiss;
input id q1 g7 q3 b2 zz gl az tre;
cards;
1 1 2 1 1 1 2 1 1
2 2 3 3 2 2 2 1 1
3 1 2 3 3 2 1 3 3
4 3 1 2 2 3 2 1 1
5 2 1 2 2 1 2 3 3
6 3 1 1 2 2 1 3 3
;;;;
run;
proc print;
run;
data manymiss;
if 0 then set nomiss;
set nomiss(keep=id b2 tre:);
run;
proc print;
run;
Another fairly simple option is to set them missing using a macro, and basic code writing techniques.
For example, let's say we have a macro:
%call_missing(var=);
call missing(&var.);
%mend call_missing;
Now we can write a query that uses dictionary.columns to identify the variables we want set to missing:
proc sql;
select name
from dictionary.columns
where libname='WORK' and memname='HAVE'
and not (name in ('ID','B2','TRE')); *note UPCASE for all these;
quit;
Now, we can combine these two things to get a macro variable containing code we want, and use that:
proc sql;
select cats('%call_missing(var=',name ,')')
into :misslist separated by ' '
from dictionary.columns
where libname='WORK' and memname='HAVE'
and not (name in ('ID','B2','TRE')); *note UPCASE for all these;
quit;
data want;
set have;
&misslist.;
run;
This has the advantage that it doesn't care about the variable types, nor the order. It has the disadvantage that it's somewhat more code, but it shouldn't be particularly long.
If the variables are all of the same type (numeric or character) then you could use an array.
data want ;
set have;
array _all_ _numeric_ ;
do over _all_;
if upcase(vname(_all_)) not in ('ID','B2') then _all_=.;
end;
run;
If you don't care about the order then just drop the variables and add them back on with 0 observations.
data want;
set have (keep=ID B2 TRE:) have (obs=0 drop=ID B2 TRE:);
run;