I'm pretty new in SAS, so I'm struggling to find out how to rearrange my data. My data set looks like this:
CPT DATE A B C D etc.
1 date1 20.000 5.000 0 0
1 date2 0 0 0 30.000
1 date3 0 10.000 10.000 0
2 date1 3.000 3.000 0 0
2 date2 0 0 5.000 3.000
etc.
where cpt(i) represents each counterparty, date(i) represents the date of my cash flows and A,B,C,D are the different types of cash flows. Since this dataset has lots of columns, I'd like to rearrange the data by increasing the number of rows when there is more than one cash flow in date(i). So the output is supposed to be this one:
CPT DATE Cash Flow Type
1 date1 20.000 A
1 date1 5.000 B
1 date2 30.000 D
1 date3 10.000 B
1 date3 10.000 C
2 date1 3.000 A
2 date2 3.000 B
2 date3 5.000 C
2 date4 3.000 D
etc.
Any tips on how to get what I want? Cheers
Datalines format of data is below.
data have;
input CPT DATE$ A B C D;
format a b c d 8.3;
datalines;
1 date1 20.000 5.000 0 0
1 date2 0 0 0 30.000
1 date3 0 10.000 10.000 0
2 date1 3.000 3.000 0 0
2 date2 0 0 5.000 3.000
;
run;
This is a 'wide to long' transpose. It's really easy!
data have;
input CPT DATE $ A B C D ;
datalines;
1 date1 20.000 5.000 0 0
1 date2 0 0 0 30.000
1 date3 0 10.000 10.000 0
2 date1 3.000 3.000 0 0
2 date2 0 0 5.000 3.000
;;;;
run;
proc transpose data=have out=want;
by cpt date;
var a b c d;
run;
If there are more complexities than this, you can also do this in the data step.
Use proc transpose. It's the easiest way to transpose any data in SAS. It'll automatically rename variable column names to COL1, COL2, etc. Use the rename= output dataset option to rename your variable to cash_flow.
proc transpose data = have
out = want(rename=(COL1 = cash_flow) )
name = type
;
by cpt date;
run;
A more tricked out TRANSPOSE can set the pivot column label and restrict the output to non-zero cashflow.
proc transpose data=have
out=want(
rename=(_name_=Type col1=cashflow)
where=(cashflow ne 0)
)
;
by cpt date;
var a b c d;
label cashflow='Cash Flow';
run;
You will have to endure a log message
WARNING: Variable CASHFLOW not found in data set WORK.HAVE.
Related
I have a dataset of patient data with each diagnosis on a different line.
This is an example of what it looks like:
patientID diabetes cancer age gender
1 1 0 65 M
1 0 1 65 M
2 1 1 23 M
2 0 0 23 M
3 0 0 50 F
3 0 0 50 F
I need to isolate the patients who have a diagnosis of both diabetes and cancer; their unique patient identifier is patientID. Sometimes they are both on the same line, sometimes they aren't. I am not sure how to do this because the information is on multiple lines.
How would I go about doing this?
This is what I have so far:
PROC SQL;
create table want as
select patientID
, max(diabetes) as diabetes
, max(cancer) as cancer
, min(DOB) as DOB
from diab_dx
group by patientID;
quit;
data final; set want;
if diabetes GE 1 AND cancer GE 1 THEN both = 1;
else both =0;
run;
proc freq data=final;
tables both;
run;
Is this correct?
If you want to learn about data steps lookup how this works.
data pat;
input patientID diabetes cancer age gender:$1.;
cards;
1 1 0 65 M
1 0 1 65 M
2 1 1 23 M
2 0 0 23 M
3 0 0 50 F
3 0 0 50 F
;;;;
run;
data both;
do until(last.patientid);
set pat; by patientid;
_diabetes = max(diabetes,_diabetes);
_cancer = max(cancer,_cancer);
end;
both = _diabetes and _cancer;
run;
proc print;
run;
add a having statement at the end of sql query should do.
PROC SQL;
create table want as
select patientID
, max(diabetes) as diabetes
, max(cancer) as cancer
, min(age) as DOB
from PAT
group by patientID
having calculated diabetes ge 1 and calculated cancer ge 1;
quit;
You might find some coders, especially those coming from statistical backgrounds, are more likely to use Proc MEANS instead of SQL or DATA step to compute the diagnostic flag maximums.
proc means noprint data=have;
by patientID;
output out=want
max(diabetes) = diabetes
max(cancer) = cancer
min(age) = age
;
run;
or for the case of all the same aggregation function
proc means noprint data=have;
by patientID;
var diabetes cancer;
output out=want max= ;
run;
or
proc means noprint data=have;
by patientID;
var diabetes cancer age;
output out=want max= / autoname;
run;
I am trying to match max daily data within a month to a monthly data.
data daily;
input permno $ date ret;
datalines;
1000 19860101 88
1000 19860102 90
1000 19860201 70
1000 19860202 55
1001 19860201 97
1001 19860202 74
1001 19860203 79
1002 19860301 55
1002 19860302 100
1002 19860301 10
;
run;
data monthly;
input permno $ date ret;
datalines;
1000 19860131 1
1000 19860228 2
1000 19860331 5
1001 19860331 3
1002 19860430 4
;
run;
The result I want is the following; (I want to match daily max data to one month lag monthly data. )
1000 19860102 90 1000 19860228 2
1000 19860201 70 1000 19860331 5
1001 19860201 97 1001 19860331 3
1002 19860302 100 1002 19860430 4
Below is what I have tried so far.
I want to have maximum ret value within a month so I have created yrmon to assign same yyyymm data for the same month daily data
data a1; set daily;
yrmon=year(date)*100 + month(date);
run;
In order to choose the maximum value(here, ret) within same yrmon group for the same permno, I used code below
proc means data=a1 noprint;
class permno yrmon ;
var ret;
output out= a2 max=maxret;
run;
However, it only got me permno yrmon ret data, leaving the original date data away.
data a3;
set a2;
new=intnx('month',yrmon,1);
format date new yymmn6.;
run;
But it won't work since yrmon is no longer date format.
Thank you in advance.
Hello
I am trying to match two different sets by permno(same company) but with one month lag (eg. daily9 dataset yrmon=198601 and monthly2 dataset yrmon=198602)
it is pretty difficult to handle for me because if I just add +1 in yrmon, 198612 +1 will not be 198701 and I am confused with handling these issues.
Can anyone help?
1) informat date1/date2 yymmn6. is used to read the date in yyyymm format
2) format date1/date2 yymmn6. is used to view the date in yyyymm format
3) intnx("months",b.date2,-1) is used to join the dates with lag of 1 month
data data1;
input date1 value1;
informat date1 yymmn6.;
format date1 yymmn6.;
cards;
200101 200
200212 300
200211 400
;
run;
data data2;
input date2 value2;
informat date2 yymmn6.;
format date2 yymmn6.;
cards;
200101 3000000
200102 4000000
200301 2000000
200212 2000000
;
run;
proc sql;
create table result as
select a.*,b.date2,b.value2 from
data1 a
left join
data2 b
on a.date1 = intnx("months",b.date2,-1);
quit;
My Output:
date1 |value1 |date2 |value2
200101 |200 |200102 |4000000
200211 |400 |200212 |2000000
200212 |300 |200301 |2000000
Let me know in case of any queries.
I have created this table:
And from this I want to create an adjacency matrix which shows how many employee_id's the tables share. It would look like this (I think):
I'm not sure if I'm going about this the correct way. I think I may be doing it wrong. I know that this is probably easier if I have more SAS products but I only have the basic SAS enterprise guide to work with.
I really appreciate the help. Thank you.
Here's another way using PROC CORR that's still better than the solution above. And you don't need to filter - it doesn't matter regarding the variables, you only specify them in the PROC CORR procedure.
data id;
input id:$4. human alien wizard;
cards;
1005 1 1 0
1018 0 0 1
1022 0 0 1
1024 1 0 0
1034 0 1 0
1069 0 1 0
1078 1 0 0
1247 1 1 1
;;;;
run;
ods output sscp=want;
proc corr data=id sscp ;
var human alien wizard;
run;
proc print data=want;
format _numeric_ 8.;
run;
Results are:
Obs Variable human alien wizard
1 human 4 2 1
2 alien 2 4 1
3 wizard 1 1 3
I think this is what you want but it does not give the thing you show as answer.
data id;
input id:$4. human alien wizard;
cards;
1005 1 1 0
1018 0 0 1
1022 0 0 1
1024 1 0 0
1034 0 1 0
1069 0 1 0
1078 1 0 0
1247 1 1 1
;;;;
run;
proc corr noprint nocorr sscp out=sscp;
var human alien wizard;
run;
proc print;
run;
I was able to get the answer using this, although it does not include the last cell I wanted (human_alien_wizard):
proc transpose data=FULL_JOIN_ALL3 out=FULL_JOIN_ALL3_v2;
by employee_id;
var human_table alien_table wizard_table;
run;
proc sql;
create table FULL_JOIN_ALL3_v3 as
select distinct a._name_ as anm,b._name_ as bnm,
count(distinct case when a.col1=1 and b.col1=1 then a.employee_id else . end) as smalln
from FULL_JOIN_ALL3_v2 a, FULL_JOIN_ALL3_v2 b
where a.employee_id=b.employee_id
group by anm,bnm
;
proc tabulate data=FULL_JOIN_ALL3_v3;
class anm bnm;
var smalln;
table anm='',bnm=''*smalln=''*sum=''*f=best3. / rts=5;
run;
I am trying to create a dummy variable A to indicate the number of ID appearances. If ID appears exactly twice, then A equals 0; if ID appears more than 2 times, then A equals 1.
My desired output is as follows:
ID YEAR A
1078 1989 0
1078 1999 0
1161 1969 0
1161 2002 0
1230 1995 0
1230 2002 0
1279 1996 0
1279 2003 0
1447 1993 0
1447 2001 0
1487 1967 1
1487 2008 1
1487 2009 1
1678 1979 0
1678 2002 0
My code is:
data new data;
set data;
by ID YEAR;
if First.ID then count=0;
count+1;
run;
data newdata;
set newdata;
by ID YEAR count;
if sum(count)=3 then A=0;
if sum(count)>3 then A=1;
run;
But the output is incorrect.
You can't use sum(count) to sum the count across multiple observations. Instead you need to first count the observations, and then merge the counts into the original data.
Data step solution
data newdata;
set data;
by ID YEAR;
if First.ID then count=0;
count+1;
if last.id then output;
drop year;
run;
if last.id then output means that newdata will only have the last observation for each id. We want that because the last observation has the county of the number of observations per id.
This step merges the original data "data" with the counts from "newdata" and computes a.
data newdata2;
merge data newdata;
by ID;
if count=2 then A=0;
if count>2 then A=1;
drop count;
run;
proc sql solution
You can do it even more quickly with proc sql:
proc sql;
create table newdata as
select id, year, (case when count(*) = 2 then 0 else 1 end) as a
from data
group by id;
quit;
group by id means that summary statistics such as count() are calculated by id, so the count(*) means count observations in each id. (case when count(*) = 2 then 0 else 1 end) as a creates a variable a. If the count for a given id is 2 then a takes the value of 0, else a takes the value of 1.
Tried various formats of date, but output do not reflects any date. What could be the issue?
data c;
input age gender income color$ doj$;
format doj date9.;
datalines;
19 1 14000 W 14/07/1988
45 2 45000 b 15/09/1956
34 2 56000 y 14/09/1967
33 1 45000 b 14/02/1956
;
run;
You are mixing things up a bit.
The date formats are to be applied on numeric data, not on text data.
So you should not read in doj as $ (text), but as a date (so a date informat).
Try DDMMYY10. for doj on your input statement:
data c;
input age gender income color$ doj ddmmyy10.;
format doj date9.;
datalines;
19 1 14000 W 14/07/1988
45 2 45000 b 15/09/1956
34 2 56000 y 14/09/1967
33 1 45000 b 14/02/1956
;
run;