I have the following problem:
I want to fill missing values with proc expand be simply taking the value from the next data row.
My data looks like this:
date;index;
29.Jun09;-1693
30.Jun09;-1692
01.Jul09;-1691
02.Jul09;-1690
03.Jul09;-1689
04.Jul09;.
05.Jul09;.
06.Jul09;-1688
07.Jul09;-1687
08.Jul09;-1686
09.Jul09;-1685
10.Jul09;-1684
11.Jul09;.
12.Jul09;.
13.Jul09;-1683
As you can see for some dates the index is missing. I want to achieve the following:
date;index;
29.Jun09;-1693
30.Jun09;-1692
01.Jul09;-1691
02.Jul09;-1690
03.Jul09;-1689
04.Jul09;-1688
05.Jul09;-1688
06.Jul09;-1688
07.Jul09;-1687
08.Jul09;-1686
09.Jul09;-1685
10.Jul09;-1684
11.Jul09;-1683
12.Jul09;-1683
13.Jul09;-1683
As you can see the values for the missing data where taken from the next row (11.Jul09 and 12Jul09 got the value from 13Jul09)
So proc expand seems to be the right approach and i started using this code:
PROC EXPAND DATA=DUMMY
OUT=WORK.DUMMY_TS
FROM = DAY
ALIGN = BEGINNING
METHOD = STEP
OBSERVED = (BEGINNING, BEGINNING);
ID date;
CONVERT index /;
RUN;
QUIT;
This filled the gaps but from the previous row and whatever I set for ALIGN, OBSERVED or even sorting the data descending I do not achieve the behavior I want.
If you know how to make it right it would be great if you could give me a hint. Good papers on proc expand are apprechiated as well.
Thanks for your help and kind regards
Stephan
I don't know about proc expand. But apparently this can be done with a few steps.
Read the dataset and create a new variable that will get the value of n.
data have;
set have;
pos = _n_;
run;
Sort this dataset by this new variable, in descending order.
proc sort data=have;
by descending pos;
run;
Use Lag or retain to fill the missing values from the "next" row (After sorting, the order will be reversed).
data want;
set have (rename=(index=index_old));
retain index;
if not missing(index_old) then index = index_old;
run;
Sort back if needed.
proc sort data=want;
by pos;
run;
I'm no PROC EXPAND expert but this is what I came up with. Create LEADS for the maximum gap run (2) then coalesce them into INDEX.
data index;
infile cards dsd dlm=';';
input date:date11. index;
format date date11.;
cards4;
29.Jun09;-1693
30.Jun09;-1692
01.Jul09;-1691
02.Jul09;-1690
03.Jul09;-1689
04.Jul09;.
05.Jul09;.
06.Jul09;-1688
07.Jul09;-1687
08.Jul09;-1686
09.Jul09;-1685
10.Jul09;-1684
11.Jul09;.
12.Jul09;.
13.Jul09;-1683
;;;;
run;
proc print;
run;
PROC EXPAND DATA=index OUT=index2 method=none;
ID date;
convert index=lead1 / transform=(lead 1);
CONVERT index=lead2 / transform=(lead 2);
RUN;
QUIT;
proc print;
run;
data index3;
set index2;
pocb = coalesce(index,lead1,lead2);
run;
proc print;
run;
Modified to work for any reasonable gap size.
data index;
infile cards dsd dlm=';';
input date:date11. index;
format date date11.;
cards4;
27.Jun09;
28.Jun09;
29.Jun09;-1693
30.Jun09;-1692
01.Jul09;-1691
02.Jul09;-1690
03.Jul09;-1689
04.Jul09;.
05.Jul09;.
06.Jul09;-1688
07.Jul09;-1687
08.Jul09;-1686
09.Jul09;-1685
10.Jul09;-1684
11.Jul09;.
12.Jul09;.
13.Jul09;-1683
14.Jul09;
15.Jul09;
16.Jul09;
17.Jul09;-1694
;;;;
run;
proc print;
run;
/* find the largest gap */
data gapsize(keep=n);
set index;
by index notsorted;
if missing(index) then do;
if first.index then n=0;
n+1;
if last.index then output;
end;
run;
proc summary data=gapsize;
output out=maxgap(drop=_:) max(n)=maxgap;
run;
/* Gen the convert statement for LEADs */
filename FT67F001 temp;
data _null_;
file FT67F001;
set maxgap;
do i = 1 to maxgap;
put 'Convert index=lead' i ' / transform=(lead ' i ');';
end;
stop;
run;
proc expand data=index out=index2 method=none;
id date;
%inc ft67f001;
run;
quit;
data index3;
set index2;
pocb = coalesce(index,of lead:);
drop lead:;
run;
proc print;
run;
Related
I can't find a way to summarize the same variable using different weights.
I try to explain it with an example (of 3 records):
data pippo;
a=10;
wgt1=0.5;
wgt2=1;
wgt3=0;
output;
a=3;
wgt1=0;
wgt2=0;
wgt3=1;
output;
a=8.9;
wgt1=1.2;
wgt2=0.3;
wgt3=0.1;
output;
run;
I tried the following:
proc summary data=pippo missing nway;
var a /weight=wgt1;
var a /weight=wgt2;
var a /weight=wgt3;
output out=pluto (drop=_freq_ _type_) sum()=;
run;
Obviously it gives me a warning because I used the same variable "a" (I can't rename it!).
I've to save a huge amount of data and not so much physical space and I should construct like 120 field (a0-a6,b0-b6 etc) that are the same variables just with fixed weight (wgt0-wgt5).
I want to store a dataset with 20 columns (a,b,c..) and 6 weight (wgt0-wgt5) and, on demand, processing a "summary" without an intermediate datastep that oblige me to create 120 fields.
Due to the huge amount of data (more or less 55Gb every month) I'd like also not to use proc sql statement:
proc sql;
create table pluto
as select sum(db.a * wgt1) as a0, sum(db.a * wgt1) as a1 , etc.
quit;
There is a "Super proc summary" that can summarize the same field with different weights?
Thanks in advance,
Paolo
I think there are a few options. One is the data step view that data_null_ mentions. Another is just running the proc summary however many times you have weights, and either using ods output with the persist=proc or 20 output datasets and then setting them together.
A third option, though, is to roll your own summarization. This is advantageous in that it only sees the data once - so it's faster. It's disadvantageous in that there's a bit of work involved and it's more complicated.
Here's an example of doing this with sashelp.baseball. In your actual case you'll want to use code to generate the array reference for the variables, and possibly for the weights, if they're not easily creatable using a variable list or similar. This assumes you have no CLASS variable, but it's easy to add that into the key if you do have a single (set of) class variable(s) that you want NWAY combinations of only.
data test;
set sashelp.baseball;
array w[5];
do _i = 1 to dim(w);
w[_i] = rand('Uniform')*100+50;
end;
output;
run;
data want;
set test end=eof;
i = .;
length varname $32;
sumval = 0 ;
sum=0;
if _n_ eq 1 then do;
declare hash h_summary(suminc:'sumval',keysum:'sum',ordered:'a');;
h_summary.defineKey('i','varname'); *also would use any CLASS variable in the key;
h_summary.defineData('i','varname'); *also would include any CLASS variable in the key;
h_summary.defineDone();
end;
array w[5]; *if weights are not named in easy fashion like this generate this with code;
array vars[*] nHits nHome nRuns; *generate this with code for the real dataset;
do i = 1 to dim(w);
do j = 1 to dim(vars);
varname = vname(vars[j]);
sumval = vars[j]*w[i];
rc = h_summary.ref();
if i=1 then put varname= sumval= vars[j]= w[i]=;
end;
end;
if eof then do;
rc = h_summary.output(dataset:'summary_output');
end;
run;
One other thing to mention though... if you're doing this because you're doing something like jackknife variance estimation or that sort of thing, or anything that uses replicate weights, consider using PROC SURVEYMEANS which can handle replicate weights for you.
You can SCORE your data set using a customized SCORE data set that you can generate
with a data step.
options center=0;
data pippo;
retain a 10 b 1.75 c 5 d 3 e 32;
run;
data score;
if 0 then set pippo;
array v[*] _numeric_;
retain _TYPE_ 'SCORE';
length _name_ $32;
array wt[3] _temporary_ (.5 1 .333);
do i = 1 to dim(v);
call missing(of v[*]);
do j = 1 to dim(wt);
_name_ = catx('_',vname(v[i]),'WGT',j);
v[i] = wt[j];
output;
end;
end;
drop i j;
run;
proc print;[enter image description here][1]
run;
proc score data=pippo score=score;
id a--e;
var a--e;
run;
proc print;
run;
proc means stackods sum;
ods exclude summary;
ods output summary=summary;
run;
proc print;
run;
enter image description here
Essentially I need to reorder my data set.
The data consists of 4 columns, one for each treatment group. I'm trying to run a simple 1-way anova in SAS but I don't know how to reorder the data so that there are two columns, one with the responses and one with the treatments.
Here is some example code to create example data sets.
data have;
input A B C D;
cards;
26.72 37.42 11.23 44.33
28.58 56.46 29.63 76.86
29.71 51.91 . .
;
run;
data want;
input Response Treatment $;
cards;
26.72 A
28.58 A
29.71 A
37.42 B
56.46 B
51.91 B
11.23 C
29.63 C
44.33 D
76.86 D
;
run;
I'm sure this is a very simple solution but I didn't see the same thing asked elsewhere on the site. I'm typically an R user but have to use SAS for this analysis so I could be looking for the wrong keywords.
I have used proc transpose for this, see below
/*1. create row numbers for each obs*/
data have1;
set have;
if _n_=1 then row=1;
else row+1;
run;
proc sort data=have1; by row; run;
/*Transpose the dataset by row number*/
proc transpose data=have1 out=want0;
by row;
run;
/*Final dataset by removing missing values*/
data want;
set want0;
drop row;
if COL1=. then delete;
rename _NAME_=Response
COL1=Treatment;
run;
proc sort data=want; by Response; run;
proc print data=want; run;
If that is your data for which you must use SAS just read so that you get the structure required for ANOVA.
data have;
do rep=1 to 3;
do trt='A','B','C','D';
input y #;
output;
end;
end;
cards;
26.72 37.42 11.23 44.33
28.58 56.46 29.63 76.86
29.71 51.91 . .
;;;;
run;
proc print;
run;
I'm using SAS and I'd like to create an indicator variable.
The data I have is like this (DATA I HAVE):
and I want to change this to (DATA I WANT):
I have a fixed number of total time that I want to use, and the starttime has duplicate time value (in this example, c1 and c2 both started at time 3). Although the example I'm using is small with 5 names and 12 time values, the actual data is very large (about 40,000 names and 100,000 time values - so the outcome I want is a matrix with 100,000x40,000.)
Can someone please provide any tips/solution on how to handle this?
40k variables is a lot. It will be interesting to see how well this scales. How do you determine the stop time?
data have;
input starttime name :$32.;
retain one 1;
cards;
1 varx
3 c1
3 c2
5 c3x
10 c4
11 c5
;;;;
run;
proc print;
run;
proc transpose data=have out=have2(drop=_name_ rename=(starttime=time));
by starttime;
id name;
var one;
run;
data time;
if 0 then set have2(drop=time);
array _n[*] _all_;
retain _n 0;
do time=.,1 to 12;
output;
call missing(of _n[*]);
end;
run;
data want0 / view=want0;
merge time have2;
by time;
retain dummy '1';
run;
data want;
length time 8;
update want0(obs=0) want0;
by dummy;
if not missing(time);
output;
drop dummy;
run;
proc print;
run;
This will work. There may be a simpler solution that does it all in one data step. My data step creates a staggered results that has to be collapsed which I do by summing in the sort/means.
data have;
input starttime name $;
datalines;
3 c1
3 c2
5 c3
10 c4
11 c5
;
run;
data want(drop=starttime name);
set have;
array cols (*) c1-c5;
do time=1 to 100;
if starttime < time then cols(_N_)=1;
else cols(_N_)=0;
output;
end;
run;
proc sort data=want;
by time;
proc means data=want noprint;
by time;
var _numeric_;
output out=want2(drop=_type_ _freq_) sum=;
run;
I am not recommending you do it this way. You didn't provide enough information to let us know why you want a matrix of that size. You may have processing issues getting it to run.
In the line do time=1 to 100 you can change that to 100000 or whatever length.
I think the code below will work:
%macro answer_macro(data_in, data_out);
/* Deduplication of initial dataset just to assure that every variable has a unique starting time*/
proc sort data=&data_in. out=data_have_nodup; by name starttime; run;
proc sort data=data_have_nodup nodupkey; by name; run;
/*Getting min and max starttime values - here I am assuming that there is only integer values form starttime*/
proc sql noprint;
select min(starttime)
,max(starttime)
into :min_starttime /*not used. Use this (and change the loop on the next dataset) to start the time variable from the value where the first variable starts*/
,:max_starttime
from data_have_nodup
;quit;
/*Getting all pairs of name/starttime*/
proc sql noprint;
select name
,starttime
into :name1 - :name1000000
,:time1 - :time1000000
from data_have_nodup
;quit;
/*Getting total number of variables*/
proc sql noprint;
select count(*) into :nvars
from data_have_nodup
;quit;
/* Creating dataset with possible start values */
/*I'm not sure this step could be done with a single datastep, but I don't have SAS
on my PC to make tests, so I used the method below*/
data &data_out.;
do i = 1 to &max_starttime. + 1;
time = i; output;
end;
drop i;
run;
data &data_out.;
set &data_out.;
%do i = 1 %to &nvars.;
if time >= &&time&i then &&name&i = 1;
else &&name&i = 0;
%end;
run;
%mend answer_macro;
Unfortunately I don't have SAS on my machine right now, so I can't confirm that the code works. But even if it doesn't, you can use the logic in it.
Suppose I have these data read into SAS:
I would like to list each unique name and the number of months it appeared in the data above to give a data set like this:
I have looked into PROC FREQ, but I think I need to do this in a DATA step, because I would like to be able to create other variables within the new data set and otherwise be able to manipulate the new data.
Data step:
proc sort data=have;
by name month;
run;
data want;
set have;
by name month;
m=month(lag(month));
if first.id then months=1;
else if month(date)^=m then months+1;
if last.id then output;
keep name months;
run;
Pro Sql:
proc sql;
select distinct name,count(distinct(month(month))) as months from have group by name;
quit;
While it's possible to do this in a data step, you wouldn't; you'd use proc freq or similar. Almost every PROC can give you an output dataset (rather than just print to the screen).
PROC FREQ data=sashelp.class;
tables age/out=age_counts noprint;
run;
Then you can use this output dataset (age_counts) as a SET input to another data step to perform your further calculations.
You can also use proc sql to group the variable and count how many are in that group. It might be faster than proc freq depending on how large your data is.
proc sql noprint;
create table counts as
select AGE, count(*) as AGE_CT from sashelp.class
group by AGE;
quit;
If you want to do it in a data step, you can use a Hash Object to hold the counted values:
data have;
do i=1 to 100;
do V = 'a', 'b', 'c';
output;
end;
end;
run;
data _null_;
set have end=last;
if _n_ = 1 then do;
declare hash cnt();
rc = cnt.definekey('v');
rc = cnt.definedata('v','v_cnt');
rc = cnt.definedone();
call missing(v_cnt);
end;
rc = cnt.find();
if rc then do;
v_cnt = 1;
cnt.add();
end;
else do;
v_cnt = v_cnt + 1;
cnt.replace();
end;
if last then
rc = cnt.output(dataset: "want");
run;
This is very efficient as it is a single loop over the data. The WANT data set contains the key and count values.
I need help with proc transpose procedure in SAS. My code initially was:
proc transpose data=temp out=temp1;
by patid;
var text;
Id datanumber;
run;
This gave me error "The ID value " " occurs twice in the same BY group". I modified the code to this:
proc sort data = temp;
by patid text datanumber;
run;
data temp;
set temp by patid text datanumber;
if first.datanunmber then n = 0;
n+1;
run;
proc sort data = temp;
by patid text datanumber n;
run;
proc transpose out=temp1 (drop=n) let;
by patid;
var text;
id datanumber;
run;
This is giving me error: variable n is not recognized. Adding a let option is giving a lot of error "occurs twice in the same BY group". I want to keep all id values.
Please help me in this.
Data Example:
Patid Text
When you get that error it is telling you that you have multiple data points for one or more variables that you are trying to create. SAS can force the transpose and delete the extra datapoints if you add "let" to the proc transpose line.
Your data is possibly not unique? I created a dataset (with unique values of patid and datanumber) and the transpose works:
data temp (drop=x y);
do x=1 to 4;
PATID='PATID'||left(x);
do y=1 to 3;
DATANUMBER='DATA'||left(y);
TEXT='TEXT'||left(x*y);
output;
end;
end;
proc sort; by _all_;
proc transpose out=temp2 (drop=_name_);
by patid;
var text;
id datanumber;
run;
my recommendation would be to forget the 'n' fix and focus on making the data unique for patid and datanumber, a dirty approach would be:
proc sort data = temp nodupkey;
by patid datanumber;
run;
at the start of your code..
Try to sort your dataset by patid text n datanumber, (n before datanumber).
Try to sort your dataset by patid n datanumber, (n before datanumber). and proc transpose "by patib n ";