I have been merge more than two datasets in sas, but the output isn't I expected. The original dataset is:
data test;
input id days value date:date9.;
format date date9.;
datalines;
128330 150 3903053 01jul2016
;
run;
The bases to merge:
data base1;
input id days1 value1 date1:date9.;
format date1 date9.;
datalines;
128330 0 3849050 01jun2015
128330 0 3827305 01jul2015
128330 0 3822779 01aug2015
128330 30 3771383 01feb2015
128330 0 3756117 01jan2015
128330 0 . 01nov2015
128330 0 3818253 01sep2015
128330 0 . 01oct2015
128330 90 3668595 01may2015
128330 60 3705683 01apr2015
128330 30 3690417 01mar2015
128330 0 3802639 01dec2015
;
run;
data base2;
input id days1 value1 date1:ddmmyy10.;
format date1 date9.;
datalines;
128330 0 3805129 1/01/2016
128330 0 3887603 1/02/2016
128330 30 3890093 1/03/2016
128330 60 3892583 1/04/2016
128330 90 3896073 1/05/2016
128330 120 3899563 1/06/2016
128330 150 3903053 1/07/2016
128330 180 3906543 1/08/2016
128330 210 3906543 1/09/2016
128330 240 3906543 1/10/2016
128330 270 3906543 1/11/2016
128330 300 3906543 1/12/2016
;
run;
data base3;
input id days1 value1 date1:ddmmyy10.;
format date1 date9.;
datalines;
128330 330 3906543 1/01/2017
128330 360 3906543 1/02/2017
128330 390 3906543 1/03/2017
128330 420 3906543 1/04/2017
128330 450 3906543 1/05/2017
128330 480 3906543 1/06/2017
128330 510 3906543 1/07/2017
128330 540 3906543 1/08/2017
128330 570 3906543 1/09/2017
128330 600 3906543 1/10/2017
128330 630 3906543 1/11/2017
;
run;
Merge
data merge1;
merge test(in=info) base1 base2 base3;
by id;
if info;
run;
The output just twelve rows, that ones belong in base3:
I expected:
I need a merge, since I work bases with thousands records, and I need merge datasets by id.
Thanks for help me.
You cannot MERGE those datasets since you have more than one dataset with multiple observations per ID. In that case the MERGE will match the observations in order and values will get overwritten. Note that if your BASEx datasets each had distinct sets of ID values such that you never had observations from one BASEx dataset merging with observations from a different BASEx then it could actually work.
So you have three datasets in one format that each could have multiple observations per id that you want to first concatenate and then merge with the other dataset that has only one observation per id.
data base ;
set base1-base3 ;
by id;
run;
data want ;
merge test base ;
by id;
run;
You could make the first data step generate a view instead of a dataset if you are worried about disk space.
data base / view=base ;
set base1-base3 ;
by id;
run;
Related
I'm pretty new in SAS, so I'm struggling to find out how to rearrange my data. My data set looks like this:
CPT DATE A B C D etc.
1 date1 20.000 5.000 0 0
1 date2 0 0 0 30.000
1 date3 0 10.000 10.000 0
2 date1 3.000 3.000 0 0
2 date2 0 0 5.000 3.000
etc.
where cpt(i) represents each counterparty, date(i) represents the date of my cash flows and A,B,C,D are the different types of cash flows. Since this dataset has lots of columns, I'd like to rearrange the data by increasing the number of rows when there is more than one cash flow in date(i). So the output is supposed to be this one:
CPT DATE Cash Flow Type
1 date1 20.000 A
1 date1 5.000 B
1 date2 30.000 D
1 date3 10.000 B
1 date3 10.000 C
2 date1 3.000 A
2 date2 3.000 B
2 date3 5.000 C
2 date4 3.000 D
etc.
Any tips on how to get what I want? Cheers
Datalines format of data is below.
data have;
input CPT DATE$ A B C D;
format a b c d 8.3;
datalines;
1 date1 20.000 5.000 0 0
1 date2 0 0 0 30.000
1 date3 0 10.000 10.000 0
2 date1 3.000 3.000 0 0
2 date2 0 0 5.000 3.000
;
run;
This is a 'wide to long' transpose. It's really easy!
data have;
input CPT DATE $ A B C D ;
datalines;
1 date1 20.000 5.000 0 0
1 date2 0 0 0 30.000
1 date3 0 10.000 10.000 0
2 date1 3.000 3.000 0 0
2 date2 0 0 5.000 3.000
;;;;
run;
proc transpose data=have out=want;
by cpt date;
var a b c d;
run;
If there are more complexities than this, you can also do this in the data step.
Use proc transpose. It's the easiest way to transpose any data in SAS. It'll automatically rename variable column names to COL1, COL2, etc. Use the rename= output dataset option to rename your variable to cash_flow.
proc transpose data = have
out = want(rename=(COL1 = cash_flow) )
name = type
;
by cpt date;
run;
A more tricked out TRANSPOSE can set the pivot column label and restrict the output to non-zero cashflow.
proc transpose data=have
out=want(
rename=(_name_=Type col1=cashflow)
where=(cashflow ne 0)
)
;
by cpt date;
var a b c d;
label cashflow='Cash Flow';
run;
You will have to endure a log message
WARNING: Variable CASHFLOW not found in data set WORK.HAVE.
I have this horizontal data:
Placebo 0.90 0.37 1.63 0.83 0.95 0.78 0.86 0.61 0.38 1.97
Alcohol 1.46 1.45 1.76 1.44 1.11 3.07 0.98 1.27 2.56 1.32
But I want it to be vertical:
Placebo Alcohol
0.90 1.46
0.37 1.45
... ...
I successfully read and transpose the data this way, but I'm searching for a more elegant solution that does the same thing without creating 2 unnecessary datasets:
data female;
input cost_female :comma. ##;
datalines;
871 684 795 838 1,033 917 1,047 723 1,179 707 817 846 975 868 1,323 791 1,157 932 1,089 770
;
data male;
input cost_male :comma. ##;
datalines;
792 765 511 520 618 447 548 720 899 788 927 657 851 702 918 528 884 702 839 878
;
data repair_costs;
merge female male;
run;
You can use proc transpose to do the same.
data have;
input medicine :$7. a1-a10;
datalines;
Placebo 0.90 0.37 1.63 0.83 0.95 0.78 0.86 0.61 0.38 1.97
Alcohol 1.46 1.45 1.76 1.44 1.11 3.07 0.98 1.27 2.56 1.32
;
run;
proc transpose data=have out=want(drop=_name_);
id medicine;
var a1-a10;
run;
Let me know in case of any doubts.
For arbitrarily wide input data you will have to use binary mode input, which is specified with RECFM=N.
This sample code creates a wide data file in transposed form. Thus the data file has one row per final dataset column and one column per final dataset row.
The code presumes CRLF line termination and tests for it explicitly. The input data set is reshaped using a single Proc TRANSPOSE.
filename flipflop 'c:\temp\rowdata-across.txt';
%let NUM_ROWS = 10000; * thus 10,000 columns of data in flipflop;
%let NUM_COLS = 30;
* simulate input data where row data is across a line of arbitrary length (that means > 32K);
* recfm=n means binary mode output, hence no LRECL limit;
data _null_;
file flipflop recfm=n;
do colindex = 1 to &NUM_COLS;
put 'column' +(-1) colindex #; * first column of output data is column name;
do rowindex=1 to &NUM_ROWS;
value = (rowindex-1) * 10 ** floor(log10(&NUM_COLS)) * 10 + colindex;
put value #; * data for rows goes across;
end;
put '0d0a'x;
end;
run;
* recfm=n means binary mode input, hence no LRECL limit;
* as filesize increases, binary mode will become slower than <32K line orientated input;
data flipflop(keep=id rowseq colseq value);
length id $32 value 8;
infile flipflop unbuffered recfm=n col=p;
colseq+1;
input id +(-1);
do rowseq=1 by 1;
input value;
output;
input test $char2.;
if test = '0d0a'x then leave;
input #+(-2);
end;
run;
proc sort data=flipflop;
by rowseq colseq;
run;
proc transpose data=flipflop out=want(drop=_name_ rowseq);
by rowseq;
id id;
var value;
run;
There might be a way to speed up reading larger (say, a file with dataline width > 32k) files in binary mode, but I have not investigated such.
Other variations could utilize a hash object, however, the entire data set would have to fit in memory.
Below is a sample of my dataset:
City Days
Atlanta 10
Tampa 95
Atlanta 100
Charlotte 20
Charlotte 31
Tampa 185
I would like to break down "Days" into buckets of 0-30, 30-90, 90-180, 180+, such that the "buckets" are along the x-axis of the table, and the cities are along the y-axis.
I tried using PROC FREQ, but I don't have SAS/STAT. Is there any way to do this in base SAS?
I believe this is what you want. This is most certainly a "brute force" approach, but I think its outlines the concept correctly.
data have;
length city $9;
input city dayscount;
cards;
Atlanta 10
Tampa 95
Atlanta 100
Charlotte 20
Charlotte 31
Tampa 185
;
run;
data want;
set have;
if dayscount >= 0 and dayscount <=30 then '0-30'n = dayscount;
if dayscount >= 30 and dayscount <=90 then '30-90'n = dayscount;
if dayscount >= 90 and dayscount <=180 then '90-180'n = dayscount;
if dayscount > 180 then '180+'n = dayscount;
drop dayscount;
run;
One of the ways for solving this problem is by using Proc Format for assigning the value bucket and then using Proc Transpose for the desired result:
data city_day_split;
length city $12.;
input city dayscount;
cards;
atlanta 10
tampa 95
atlanta 100
charlotte 20
charlotte 31
tampa 185
;
run;
/****Assigning the buckets****/
proc format;
value buckets
0 - <30 = '0-30'
30 - <90 = '30-90'
90 - <180 = '90-180'
180 - high = 'gte180'
;
run;
data city_day_split;
set city_day_split;
day_bucket = put(dayscount,buckets.);
run;
proc sort data=city_day_split out=city_day_split;
by city;
run;
/****Making the Buckets as columns, City as rows and daycount as Value****/
proc transpose data=city_day_split out=city_day_split_1(drop=_name_);
by city;
id day_bucket;
var dayscount;
run;
My Output:
> **city |0-30 |90-180 |30-90 |GTE180**
> Atlanta |10 |100 |. |.
> Charlotte |20 |. |31 |.
> Tampa |. |95 |. |185
Tried various formats of date, but output do not reflects any date. What could be the issue?
data c;
input age gender income color$ doj$;
format doj date9.;
datalines;
19 1 14000 W 14/07/1988
45 2 45000 b 15/09/1956
34 2 56000 y 14/09/1967
33 1 45000 b 14/02/1956
;
run;
You are mixing things up a bit.
The date formats are to be applied on numeric data, not on text data.
So you should not read in doj as $ (text), but as a date (so a date informat).
Try DDMMYY10. for doj on your input statement:
data c;
input age gender income color$ doj ddmmyy10.;
format doj date9.;
datalines;
19 1 14000 W 14/07/1988
45 2 45000 b 15/09/1956
34 2 56000 y 14/09/1967
33 1 45000 b 14/02/1956
;
run;
I have a data set where a patient can have multiple (and unknown) values for some variables that ends up looking something like this:
ID Var1 Var2 Var3 Var4
1 Blue Female 17 908
1 Blue Female 17 909
1 Red Female 17 910
1 Red Female 17 911
...
99 Blue Female 14 908
100 Red Male 28 911
I want to pack this data down so that each ID has only a single entry, with indicators for the presence or absence of one of the values in their original slew of entries. So, for example, something like this:
ID YesBlue Var2 Var3 Yes911
1 1 Female 17 1
99 1 Female 14 0
100 0 Male 28 1
Is there a straightforward way to do this in SAS? Or failing that, in Access (where the data is coming from) which I have no idea really how to use.
If your data set is called PATIENTS1, maybe something like this:
proc sql noprint;
create table patients2 as
select *
,case(var1)
when "Blue" then 1
else 0
end as ablue
,case(var4)
when 911 then 1
else 0
end as a911
,max(calculated ablue) as yesblue
,max(calculated a911) as yes911
from patients1
group by id
order by id;
quit;
proc sort data=patients2 out=patients3(drop=var1 var4 ablue a911) nodupkey;
by id;
run;
Here's a data step solution. I'm assuming that the values for Var2 and Var3 are always the same for a given ID.
data have;
input ID Var1 $ Var2 $ Var3 Var4;
cards;
1 Blue Female 17 908
1 Blue Female 17 909
1 Red Female 17 910
1 Red Female 17 911
99 Blue Female 14 908
100 Red Male 28 911
;
run;
data want (drop=Var1 Var4 _:);
set have;
by ID;
if first.ID then do;
_blue=0;
_911=0;
end;
_blue+(Var1='Blue');
_911+(Var4=911);
if last.ID then do;
YesBlue=(_blue>0);
Yes911=(_911>0);
output;
end;
run;
EDIT: Looks like the same thing Keith said, only written differently.
This should do it:
data test;
input id Var1 $ Var2 $ Var3 Var4;
datalines;
1 Blue Female 17 908
1 Blue Female 17 909
1 Red Female 17 910
1 Red Female 17 911
99 Blue Female 14 908
100 Red Male 28 911
run;
data flatten(drop=Var1 Var4);
set test;
retain YesBlue;
retain Yes911;
by id;
if first.id then do;
YesBlue = 0;
Yes911 = 0;
end;
if Var1 eq "Blue" then YesBlue = 1;
if Var4 eq 911 then Yes911 = 1;
if last.id then output;
run;
PROC SQL is perfect for things like this. This a similar to DavB's answer, but eliminates the additional sort:
data have;
input ID Var1 $ Var2 $ Var3 Var4;
cards;
1 Blue Female 17 908
1 Blue Female 17 909
1 Red Female 17 910
1 Red Female 17 911
99 Blue Female 14 908
100 Red Male 28 911
;
run;
proc sql;
create table want as
select ID
, max(case(var1)
when 'Blue'
then 1
else 0 end) as YesBlue
, max(var2) as Var2
, max(var3) as Var3
, max(case(var4)
when 911
then 1
else 0 end) as Yes911
from have
group by id
order by id;
quit;
It also safely reduces your original data by the ID variable, but at the risk of possible errors if the source is not exactly as you describe.