I am hoping you can help me! Please help!!!!
I am in SAS using PROC SQL and I have datasets A and B with different measurements (relating to patient's health) as follows:
Dataset A
ID Date measurement_a
1 20JUN2013 52.3
1 12JUL2013 65.6
1 28NOV2014 37.4
1 02DEC2014 61.3
1 22SEP2015 40.5
1 15OCT2015 60.5
2 03JUN2011 46.5
2 19JUL2011 54.1
2 29OCT2012 53.6
...
Dataset B
ID Date measurement_b
1 21MAR2007 43
1 13JUL2007 45
1 07APR2009 47
1 14MAY2009 46
1 16FEB2012 42
1 27AUG2012 53
1 12DEC2012 58
1 20JUN2013 56
1 10DEC2013 53
1 23MAY2014 49
1 17SEP2014 44
1 23SEP2015 40
2 16DEC2011 58
2 22AUG2012 54
2 20FEB2013 56
2 29MAY2013 53
...
What I am looking for is that if the date in Dataset B is within 6 months of the date in Dataset a, then a new variable called "time" will be added, saying 1,2,3,etc. for how many ever match with ** only measurement_a** length (in other words, I do not need to retain values of measurement_b if it does not match the date in Dataset a. Here is an example of what I mean:
Desired result/dataset:
ID Time measurement_a measurement_b
1 1 52.3 56 (Dataset B Date = 20JUN2013 - Matched exactly)
1 2 65.6 53 (Dataset B date = 10DEC2013 - Within six months of 12JUL2013 [Dataset A Date])
1 3 37.4 44 (Dataset B date = 17SEP2014 - Within six months of 28NOV2014 [Dataset A Date])
1 4 61.3 . (because 17SEP2014 [Dataset B] is closest to 28NOV2014 [Dataset A])
1 5 40.5 40 (because 23SEP2015 [Dataset B] is closest to 22SEP2015 [Dataset A])
1 6 60.5 . (No date in Dataset B that is within 6 months of Date in Dataset A [15OCT2015])
2 1 46.5 . (See below)
2 2 54.1 58 (because 03JUL2011 [Dataset B] is closest to 19JUL2011 [Dataset A])
2 3 53.6 54 (Dataset B date = 22AUG2012 - Within 6 months of Dataset A date = 29OCT2012)
...
I have joined on ID but the times is proving difficult. I know it could be the difference in months in the "where" statement in the following code:
PROC SQL;
CREATE TABLE join_test as
SELECT * FROM data_a as a
LEFT_JOIN data_b as b
ON a.id = b.id
WHERE days(a.Date - b.Date) <= 180 ;
QUIT;
But this does not do the trick.
Can some please help me?
I really appreciate it. Thanks in advance.
In the join criteria add the use of the SAS function INTCK to compute the number of month intervals between the two date values. Proc SQL does not have a way to introduce a serial count value, so you will have to add that in a subsequent step. A LEFT JOIN will create a result set with every id/date in table A.
Example:
The columns a.date, b_date and c_months_apart were added to show how the join works. You can safely remove them from the select.
proc sql;
create table stage1 as
select
a.id
, a.date
, a.measurement_a
, b.measurement_b
, b.date as b_date
, intck('month', a.date, b.date, 'C') as c_months_apart
from
a left join b
on a.id = b.id
and intck('month', a.date, b.date, 'C') between 0 and 6
order by a.id, a.date, b.date
;
data want;
set stage1;
by id;
if first.id then time=1; else time+1;
run;
Output (want)
measurement_ measurement_ c_months_
ID Date a b b_date apart time
1 20JUN2013 52.3 56 20JUN2013 0 1
1 20JUN2013 52.3 53 10DEC2013 5 2
1 12JUL2013 65.6 53 10DEC2013 4 3
1 28NOV2014 37.4 . . . 4
1 02DEC2014 61.3 . . . 5
1 22SEP2015 40.5 40 23SEP2015 0 6
1 15OCT2015 60.5 . . . 7
2 03JUN2011 46.5 58 16DEC2011 6 1
2 19JUL2011 54.1 58 16DEC2011 4 2
2 29OCT2012 53.6 56 20FEB2013 3 3
Related
How to Capture previous row value and perform subtraction
Refer Table 1 as main data, Table 2 as desired output, Let me explain you in detail, Closing_Bal is derived from (Opening_bal - EMI) for eg if (20 - 2) = 18, as value 18 i want in 2nd row under opening_bal column then ( opening_bal - EMI) and so till new LAN , If New LAN available then start the loop again ,
i have created lag function butnot able to run loop
Try this
data A;
input Month $ LAN Opening_Bal EMI Closing_Bal;
infile datalines dlm = '|' dsd;
datalines;
1_Nov|1|20|2|18
2_Dec|1| |3|
3_Jan|1| |5|
4_Feb|1| |3|
1_Nov|2|30|4|26
2_Dec|2| |3|
3_Jan|2| |2|
4_Feb|2| |5|
5_Mar|2| |6|
;
data B(drop = c);
set A;
by LAN;
if first.LAN then c = Closing_Bal;
if Opening_Bal = . then do;
Opening_Bal = c;
Closing_Bal = Opening_Bal - EMI;
c = Closing_Bal;
end;
retain c;
run;
Result:
Month LAN Opening_Bal EMI Closing_Bal
1_Nov 1 20 2 18
2_Dec 1 18 3 15
3_Jan 1 15 5 10
4_Feb 1 10 3 7
1_Nov 2 30 4 26
2_Dec 2 26 3 23
3_Jan 2 23 2 21
4_Feb 2 21 5 16
5_Mar 2 16 6 10
The problem is that you already have CLOSING_BAL on the input dataset, so when the SET statement reads a new observation it will overwrite the value calculated on the previous observation. Either drop or rename the variable in the source dataset.
Example:
data have;
input Month $ LAN Opening_Bal EMI Closing_Bal;
datalines;
1_Nov 1 20 2 18
2_Dec 1 . 3 .
3_Jan 1 . 5 .
4_Feb 1 . 3 .
1_Nov 2 30 4 26
2_Dec 2 . 3 .
3_Jan 2 . 2 .
4_Feb 2 . 5 .
5_Mar 2 . 6 .
;
data want;
set have (drop=closing_bal);
retain Closing_Bal;
Opening_Bal=coalesce(Opening_Bal,Closing_Bal);
Closing_bal=Opening_bal - EMI ;
run;
Results:
Opening_ Closing_
Obs Month LAN Bal EMI Bal
1 1_Nov 1 20 2 18
2 2_Dec 1 18 3 15
3 3_Jan 1 15 5 10
4 4_Feb 1 10 3 7
5 1_Nov 2 30 4 26
6 2_Dec 2 26 3 23
7 3_Jan 2 23 2 21
8 4_Feb 2 21 5 16
9 5_Mar 2 16 6 10
I am not sure this works
data B;
set A;
by lan;
if not first.lan then do;
opening_bal = lag(closing_bal);
closing_bal = opening_bal - EMI;
end;
run;
because you don't execute lag for each observation.
My goal is to combine these tables into one without having to manually run my macro each time for each column.
The code I currently have with me is the following:
%macro task_Oct(set,col_name);
data _type_;
set &set;
call symputx('col_type', vtype(&col_name));
run;
proc sql;
create table work.oct27 as
select "&col_name" as variable,
"&col_type" as type,
nmiss(&col_name) as missing_val,
count(distinct &col_name) as distinct_val
from &set;
quit;
%mend task_Oct;
%task_Oct(sashelp.cars,Origin)
The above code gives me the following output:
|Var |Type |missing_val|distinct_val|
|Origin|Character|0 | 3 |
But the sashelp.cars data sheet has 15 columns and so I would like to output a new data sheet which has 15 rows with 4 columns.
I would like to get the following combined table as the output of my code:
|Var |Type |missing_val|distinct_val|
|Make |Character|0 | 38 |
|Model |Character|0 | 425 |
|Type |Character|0 | 6 |
|Origin|Character|0 | 3 |
...
...
...
Since I'm using a macro, I could run my code 15 different times by manually entering the names of the columns and then merging the tables into 1; and it wouldn't be a problem. But what if I have a table with 100s of columns? I could use some loop statement but I'm not sure how to go about that in this case. Help would be appreciated. Thank you.
The main output you appear to want can be generated directly by PROC FREQ with the NLEVELS option. If you want to add in the variable type then just merge it with the output of PROC CONTENTS.
ods exclude all;
ods output nlevels=nlevels;
proc freq data=sashelp.cars nlevels;
tables _all_ / noprint ;
run;
ods select all;
proc contents data=sashelp.cars noprint out=contents;
run;
proc sql;
create table want(drop=table:) as
select c.varnum,c.name,c.type,n.*
from contents c inner join nlevels n
on c.name=n.TableVar
order by varnum
;
quit;
Result
NNon
NMiss Miss
Obs VARNUM NAME TYPE NLevels Levels Levels
1 1 Make 2 38 0 38
2 2 Model 2 425 0 425
3 3 Type 2 6 0 6
4 4 Origin 2 3 0 3
5 5 DriveTrain 2 3 0 3
6 6 MSRP 1 410 0 410
7 7 Invoice 1 425 0 425
8 8 EngineSize 1 43 0 43
9 9 Cylinders 1 8 1 7
10 10 Horsepower 1 110 0 110
11 11 MPG_City 1 28 0 28
12 12 MPG_Highway 1 33 0 33
13 13 Weight 1 348 0 348
14 14 Wheelbase 1 40 0 40
15 15 Length 1 67 0 67
The NMissLevels variable counts the number of distinct types of missing values.
If instead you want to count the number of observations with (any) missing value you will need to use code generation. So use the CONTENTS data to generate an SQL query to generate all of the counts you want into a single observation with only one pass through the data. You can then transpose that to make it usable for re-merging with the CONTENTS data.
filename code temp;
data _null_;
set contents end=eof;
length nliteral $65 dsname $80;
nliteral=nliteral(name);
dsname = catx('.',libname,nliteral(memname));
file code;
if _n_=1 then put 'create table counts as select ' / ' ' # ;
else put ',' #;
put 'nmiss(' nliteral ') as missing_' varnum
/',count(distinct ' nliteral ') as distinct_' varnum
;
if eof then put 'from ' dsname ';';
run;
proc sql;
%include code /source2;
quit;
proc transpose data=counts out=count2 name=name ;
run;
proc sql ;
create table want as
select c.varnum, c.name, c.type
, m.col1 as nmissing
, d.col1 as ndistinct
from contents c
left join count2 m on m.name like 'missing%' and c.varnum=input(scan(m.name,-1,'_'),32.)
left join count2 d on d.name like 'distinct%' and c.varnum=input(scan(d.name,-1,'_'),32.)
order by varnum
;
quit;
Result
Obs VARNUM NAME TYPE nmissing ndistinct
1 1 Make 2 0 38
2 2 Model 2 0 425
3 3 Type 2 0 6
4 4 Origin 2 0 3
5 5 DriveTrain 2 0 3
6 6 MSRP 1 0 410
7 7 Invoice 1 0 425
8 8 EngineSize 1 0 43
9 9 Cylinders 1 2 7
10 10 Horsepower 1 0 110
11 11 MPG_City 1 0 28
12 12 MPG_Highway 1 0 33
13 13 Weight 1 0 348
14 14 Wheelbase 1 0 40
15 15 Length 1 0 67
Say I have a dataset like this
day product sales
1 a 1 48
2 a 2 55
3 a 3 88
4 b 2 33
5 b 3 87
6 c 1 97
7 c 2 95
On day "b" there were no sales for product 1, so there is no row where day = b and product = 1. Is there an easy way to add a row with day = b, product = 1 and sales = 0, and similar "missing" rows to get a dataset like this?
day product sales
1 a 1 48
2 a 2 55
3 a 3 88
4 b 1 0
5 b 2 33
6 b 3 87
7 c 1 97
8 c 2 95
9 c 3 0
In R you can do complete(df, day, product, fill = list(sales = 0)). I realize you can accomplish this with a self-join in proc sql, but I'm wondering if there is a procedure for this.
In this particular example you can also use the SPARSE option in PROC FREQ. It tells SAS to generate all the complete types with every value from DAY included with PRODUCT, so similar to a cross join between those elements. If you do not have the value in the table already it cannot add the value. You would need a different method in that case.
data have;
input n day $ product sales;
datalines;
1 a 1 48
2 a 2 55
3 a 3 88
4 b 2 33
5 b 3 87
6 c 1 97
7 c 2 95
;;;;
run;
proc freq data=have noprint;
table day*product / out=want sparse;
weight sales;
run;
proc print data=want;run;
There are, as usual in SAS, about a dozen ways to do this. Here's my favorite.
data have;
input n day $ product sales;
datalines;
1 a 1 48
2 a 2 55
3 a 3 88
4 b 2 33
5 b 3 87
6 c 1 97
7 c 2 95
;;;;
run;
proc means data=have completetypes;
class day product;
types day*product;
var sales;
output out=want sum=;
run;
completetypes tells SAS to put out rows for every class combination, including missing ones. You could then use proc stdize to get them to be 0's (if you need them to be 0). It's possible you might be able to do this in the first place with proc stdize, I'm not as familiar unfortunately with that proc.
You can do this with proc freq using the sparse option.
Code:
proc freq data=have noprint;
table day*product /sparse out=freq (drop=percent);
run;
Output:
day=a product=1 COUNT=1
day=a product=2 COUNT=1
day=a product=3 COUNT=1
day=b product=1 COUNT=0
day=b product=2 COUNT=1
day=b product=3 COUNT=1
day=c product=1 COUNT=1
day=c product=2 COUNT=1
day=c product=3 COUNT=0
I have a SAS Table like:
DATA test;
INPUT id sex $ age inc r1 r2 Zaehler work $;
DATALINES;
1 F 35 17 7 2 1 w
17 M 40 14 5 5 1 w
33 F 35 6 7 2 1 w
49 M 24 14 7 5 1 w
65 F 52 9 4 7 1 w
81 M 44 11 7 7 1 w
2 F 35 17 6 5 1 n
18 M 40 14 7 5 1 n
34 F 47 6 6 5 1 n
50 M 35 17 5 7 1 w
;
PROC PRINT; RUN;
proc sort data=have;
by county;
run;
I want compare rows if sex and age is equal and build sum over Zaehler. For example:
1 F 35 17 7 2 1 w
and
33 F 35 6 7 2 1 w
sex=f and age=35 are equale so i want to merge them like:
id sex age inc r1 r2 Zaehler work
1 F 35 17 7 2 2 w
I thought i can do it with proc sql but i can't use sum in proc sql. Can someone help me out?
PROC SUMMARY is the normal way to compute statistics.
proc summary data=test nway ;
class sex age ;
var Zaehler;
output out=want sum= ;
run;
Why would you want to include variables other than SEX, AGE and Zaehler in the output?
Your requirement is not difficult to understand or to satisfy, however, I am not sure what is your underline reason for doing this. Explain more on your purpose may help to facilitate better answers that work from the root of your project. Although I have a feeling the PROC MEAN may give you better matrix, here is a one step PROC SQL solution to get you the summary as well as retaining "the value of first row":
proc sql;
create table want as
select id, sex , age, inc, r1, r2, sum(Zaehler) as Zaehler, work
from test
group by sex, age
having id = min(id) /*This is tell SAS only to keep the row with the smallest id within the same sex,age group*/
;
quit;
You can use proc sql to sum over sex and age
proc sql;
create table sum as
select
sex
,age
,sum(Zaehler) as Zaehler_sum
from test
group by
sex
,age;
quit;
You can than join it back to the main table if you want to include all the variables
proc sql;
create table test_With_Sum as
select
t.*
,s.Zaehler_sum
from test t
inner join sum s on t.sex = s.sex
and t.age = s.age
order by
t.sex
,t.age
;
quit;
You can write it all as one proc sql query if you wish and the order by is not needed, only added for a better visibility of summarised results
Not a good solution. But it should give you some ideas.
DATA test;
INPUT id sex $ age inc r1 r2 Zaehler work $;
DATALINES;
1 F 35 17 7 2 1 w
17 M 40 14 5 5 1 w
33 F 35 6 7 2 1 w
49 M 24 14 7 5 1 w
65 F 52 9 4 7 1 w
81 M 44 11 7 7 1 w
2 F 35 17 6 5 1 n
18 M 40 14 7 5 1 n
34 F 47 6 6 5 1 n
50 M 35 17 5 7 1 w
;
run;
data t2;
set test;
nobs = _n_;
run;
proc sort data=t2;by descending sex descending age descending nobs;run;
data t3;
set t2;
by descending sex descending age;
if first.age then count = 0;
count + 1;
zaehler = count;
if last.age then output;
run;
proc sort data=t3 out=want(drop=nobs count);by nobs sex age;run;
thanks for your help. Here is my final code.
proc sql;
create table sum as
select distinct
sex
,age
,sum(Zaehler) as Zaehler
from test
WHERE work = 'w'
group by
sex
,age
;
PROC PRINT;quit;
I just modify the code a little bit. I filtered the w and i merg the Columns with the same value.
It was just an example the real Data is much bigger and has more Columns and rows.
Hi I am trying to subset a dataset which has following
ID sal count
1 10 1
1 10 2
1 10 3
1 10 4
2 20 1
2 20 2
2 20 3
3 30 1
3 30 2
3 30 3
3 30 4
I want to take out only those IDs who are recorded 4 times.
I wrote like
data AN; set BU
if last.count gt 4 and last.count lt 4 then delete;
run;
But there is something wrong.
EDIT - Thanks for clarifying. Based on your needs, PROC SQL will be more direct:
proc sql;
CREATE TABLE AN as
SELECT * FROM BU
GROUP BY ID
HAVING MAX(COUNT) = 4
;quit;
For posterity, here is how you could do it with only a data step:
In order to use first. and last., you need to use a by clause, which requires sorting:
proc sort data=BU;
by ID DESCENDING count;
run;
When using a SET statement BY ID, first.ID will be equal to 1 (TRUE) on the first instance of a given ID, 0 (FALSE) for all other records.
data AN;
set BU;
by ID;
retain keepMe;
If first.ID THEN DO;
IF count = 4 THEN keepMe=1;
ELSE keepMe=0;
END;
if keepMe=0 THEN DELETE;
run;
During the datastep BY ID, your data will look like:
ID sal count keepMe first.ID
1 10 4 1 1
1 10 3 1 0
1 10 2 1 0
1 10 1 1 0
2 20 3 0 1
2 20 2 0 0
2 20 1 0 0
3 30 4 1 1
3 30 3 1 0
3 30 2 1 0
3 30 1 1 0
If I understand correct, you are trying to extract all observations are are repeated 4 time or more. if so, your use of last.count and first.count is wrong. last.var is a boolean and it will indicate which observation is last in the group. Have a look at Tim's suggestion.
In order to extract all observations that are repeated four times or more, I would suggest to use the following PROC SQL:
PROC SQL;
CREATE TABLE WORK.WANT AS
SELECT /* COUNT_of_ID */
(COUNT(t1.ID)) AS COUNT_of_ID,
t1.ID,
t1.SAL,
t1.count
FROM WORK.HAVE t1
GROUP BY t1.ID
HAVING (CALCULATED COUNT_of_ID) ge 4
ORDER BY t1.ID,
t1.SAL,
t1.count;
QUIT;
Result:
1 10 1
1 10 2
1 10 3
1 10 4
3 30 1
3 30 2
3 30 3
3 30 4
Slight variation on Tims - assuming you don't necessarily have the count variable.
proc sql;
CREATE TABLE AN as
SELECT * FROM BU
GROUP BY ID
HAVING Count(ID) >= 4;
quit;