proc transpose with duplicate ID values - sas

I need help with proc transpose procedure in SAS. My code initially was:
proc transpose data=temp out=temp1;
by patid;
var text;
Id datanumber;
run;
This gave me error "The ID value " " occurs twice in the same BY group". I modified the code to this:
proc sort data = temp;
by patid text datanumber;
run;
data temp;
set temp by patid text datanumber;
if first.datanunmber then n = 0;
n+1;
run;
proc sort data = temp;
by patid text datanumber n;
run;
proc transpose out=temp1 (drop=n) let;
by patid;
var text;
id datanumber;
run;
This is giving me error: variable n is not recognized. Adding a let option is giving a lot of error "occurs twice in the same BY group". I want to keep all id values.
Please help me in this.
Data Example:
Patid Text

When you get that error it is telling you that you have multiple data points for one or more variables that you are trying to create. SAS can force the transpose and delete the extra datapoints if you add "let" to the proc transpose line.

Your data is possibly not unique? I created a dataset (with unique values of patid and datanumber) and the transpose works:
data temp (drop=x y);
do x=1 to 4;
PATID='PATID'||left(x);
do y=1 to 3;
DATANUMBER='DATA'||left(y);
TEXT='TEXT'||left(x*y);
output;
end;
end;
proc sort; by _all_;
proc transpose out=temp2 (drop=_name_);
by patid;
var text;
id datanumber;
run;
my recommendation would be to forget the 'n' fix and focus on making the data unique for patid and datanumber, a dirty approach would be:
proc sort data = temp nodupkey;
by patid datanumber;
run;
at the start of your code..

Try to sort your dataset by patid text n datanumber, (n before datanumber).

Try to sort your dataset by patid n datanumber, (n before datanumber). and proc transpose "by patib n ";

Related

SaS 9.4: How to use different weights on the same variable without datastep or proc sql

I can't find a way to summarize the same variable using different weights.
I try to explain it with an example (of 3 records):
data pippo;
a=10;
wgt1=0.5;
wgt2=1;
wgt3=0;
output;
a=3;
wgt1=0;
wgt2=0;
wgt3=1;
output;
a=8.9;
wgt1=1.2;
wgt2=0.3;
wgt3=0.1;
output;
run;
I tried the following:
proc summary data=pippo missing nway;
var a /weight=wgt1;
var a /weight=wgt2;
var a /weight=wgt3;
output out=pluto (drop=_freq_ _type_) sum()=;
run;
Obviously it gives me a warning because I used the same variable "a" (I can't rename it!).
I've to save a huge amount of data and not so much physical space and I should construct like 120 field (a0-a6,b0-b6 etc) that are the same variables just with fixed weight (wgt0-wgt5).
I want to store a dataset with 20 columns (a,b,c..) and 6 weight (wgt0-wgt5) and, on demand, processing a "summary" without an intermediate datastep that oblige me to create 120 fields.
Due to the huge amount of data (more or less 55Gb every month) I'd like also not to use proc sql statement:
proc sql;
create table pluto
as select sum(db.a * wgt1) as a0, sum(db.a * wgt1) as a1 , etc.
quit;
There is a "Super proc summary" that can summarize the same field with different weights?
Thanks in advance,
Paolo
I think there are a few options. One is the data step view that data_null_ mentions. Another is just running the proc summary however many times you have weights, and either using ods output with the persist=proc or 20 output datasets and then setting them together.
A third option, though, is to roll your own summarization. This is advantageous in that it only sees the data once - so it's faster. It's disadvantageous in that there's a bit of work involved and it's more complicated.
Here's an example of doing this with sashelp.baseball. In your actual case you'll want to use code to generate the array reference for the variables, and possibly for the weights, if they're not easily creatable using a variable list or similar. This assumes you have no CLASS variable, but it's easy to add that into the key if you do have a single (set of) class variable(s) that you want NWAY combinations of only.
data test;
set sashelp.baseball;
array w[5];
do _i = 1 to dim(w);
w[_i] = rand('Uniform')*100+50;
end;
output;
run;
data want;
set test end=eof;
i = .;
length varname $32;
sumval = 0 ;
sum=0;
if _n_ eq 1 then do;
declare hash h_summary(suminc:'sumval',keysum:'sum',ordered:'a');;
h_summary.defineKey('i','varname'); *also would use any CLASS variable in the key;
h_summary.defineData('i','varname'); *also would include any CLASS variable in the key;
h_summary.defineDone();
end;
array w[5]; *if weights are not named in easy fashion like this generate this with code;
array vars[*] nHits nHome nRuns; *generate this with code for the real dataset;
do i = 1 to dim(w);
do j = 1 to dim(vars);
varname = vname(vars[j]);
sumval = vars[j]*w[i];
rc = h_summary.ref();
if i=1 then put varname= sumval= vars[j]= w[i]=;
end;
end;
if eof then do;
rc = h_summary.output(dataset:'summary_output');
end;
run;
One other thing to mention though... if you're doing this because you're doing something like jackknife variance estimation or that sort of thing, or anything that uses replicate weights, consider using PROC SURVEYMEANS which can handle replicate weights for you.
You can SCORE your data set using a customized SCORE data set that you can generate
with a data step.
options center=0;
data pippo;
retain a 10 b 1.75 c 5 d 3 e 32;
run;
data score;
if 0 then set pippo;
array v[*] _numeric_;
retain _TYPE_ 'SCORE';
length _name_ $32;
array wt[3] _temporary_ (.5 1 .333);
do i = 1 to dim(v);
call missing(of v[*]);
do j = 1 to dim(wt);
_name_ = catx('_',vname(v[i]),'WGT',j);
v[i] = wt[j];
output;
end;
end;
drop i j;
run;
proc print;[enter image description here][1]
run;
proc score data=pippo score=score;
id a--e;
var a--e;
run;
proc print;
run;
proc means stackods sum;
ods exclude summary;
ods output summary=summary;
run;
proc print;
run;
enter image description here

Highlight the corresponding line number with another datasets

I have two datasets, one extract the extreme values from proc univariate. I would like to create a new variable and label them as 1 if the n in the original dataset equals the extracted line number in the univariate dataset. But I don't know how to program it not manually enter the line number.
 
There're a few ways to do this, but one easy way is to just add the rownum to the original dataset and merge on it.
Here's an example.
ods output extremeobs=extreme_test;
proc univariate data=sashelp.heart;
run;
ods output close;
data extreme_diastolic extreme_systolic; *just creating the extreme datasets;
set extreme_test;
if varname='Diastolic' then output extreme_diastolic;
else if varname='Systolic' then output extreme_systolic;
run;
data for_merge; *adding rownum on to the original dataset;
set sashelp.heart;
rownum = _n_;
run;
*now, sort the extreme datasets by the `highobs` and `lowobs` values respectively and save those as `rownum`, so they can be merged;
proc sort data=extreme_diastolic out=high_diastolic(keep=highobs rename=highobs=rownum);
by highobs;
run;
proc sort data=extreme_systolic out=high_systolic(keep=highobs rename=highobs=rownum);
by highobs;
run;
proc sort data=extreme_diastolic out=low_diastolic(keep=lowobs rename=lowobs=rownum);
by lowobs;
run;
proc sort data=extreme_systolic out=low_systolic(keep=lowobs rename=lowobs=rownum);
by lowobs;
run;
*now, merge those on using `in=` to identify which are matches.;
data heart_extremes;
merge for_merge high_diastolic(in=_highd) high_systolic(in=_highs) low_diastolic(in=_lowd) low_systolic(in=_lows);
by rownum;
if _highd then high_diastolic = 1;
if _highs then high_systolic = 1;
if _lowd then low_diastolic = 1;
if _lows then low_systolic = 1;
run;

Covert wide to long in sas when all the variable has the suffix needed

I want the first wide dataset to be as the second long datafile, I have thought about using array, but considering I have 100 variables (the example only have 2), do I need 100 arrays?
Could you let me know how to do?
Use a double transpose. First transpose to a tall structure. Then split the name into the basename and time. Then transpose again. Here is untested code since no example data was provided (only photographs).
proc transpose data=have out=tall ;
by id;
var _numeric_;
run;
data fixed ;
set tall ;
time = scan(_name_,-1,'_');
_name_ = substr(_name_,1,length(_name_)-length(time)-1);
run;
proc sort data=fixed ;
by id time;
run;
proc transpose data=fixed out=want ;
by id time ;
id _name_;
var col1;
run;

Transpose the dataset

I have the original dataset
What I want is:
My code:
proc transpose data = lib.original
out= lib.new(rename=(col1=Mean col2=Median));
var WBCmean RBCmean WBCmedian RBCmedian;
run;
But I get
Can you give some hint?
EDIT
If I add by statement,
proc transpose data = lib.original
out= lib.new;
by Gender;
var WBCmean RBCmean WBCmedian RBCmedian;
run;
then I get
One way is to convert to a tall skinny format.
proc transpose data = have out=middle ;
by gender ;
var WBCmean RBCmean WBCmedian RBCmedian;
run;
Then generate the new variables needed,
data middle;
set middle ;
testcode = substr(_name_,1,3);
_name_ = substr(_name_,4);
run;
you might need to sort the data depending on the order of variable names in your original VAR statement,
proc sort data=middle;
by gender testcode _name_;
run;
and then transpose into the new arrangement.
proc transpose data=middle out=want ;
by gender testcode ;
id _name_;
var col1 ;
run;

Use SAS proc expand for filling missing values

I have the following problem:
I want to fill missing values with proc expand be simply taking the value from the next data row.
My data looks like this:
date;index;
29.Jun09;-1693
30.Jun09;-1692
01.Jul09;-1691
02.Jul09;-1690
03.Jul09;-1689
04.Jul09;.
05.Jul09;.
06.Jul09;-1688
07.Jul09;-1687
08.Jul09;-1686
09.Jul09;-1685
10.Jul09;-1684
11.Jul09;.
12.Jul09;.
13.Jul09;-1683
As you can see for some dates the index is missing. I want to achieve the following:
date;index;
29.Jun09;-1693
30.Jun09;-1692
01.Jul09;-1691
02.Jul09;-1690
03.Jul09;-1689
04.Jul09;-1688
05.Jul09;-1688
06.Jul09;-1688
07.Jul09;-1687
08.Jul09;-1686
09.Jul09;-1685
10.Jul09;-1684
11.Jul09;-1683
12.Jul09;-1683
13.Jul09;-1683
As you can see the values for the missing data where taken from the next row (11.Jul09 and 12Jul09 got the value from 13Jul09)
So proc expand seems to be the right approach and i started using this code:
PROC EXPAND DATA=DUMMY
OUT=WORK.DUMMY_TS
FROM = DAY
ALIGN = BEGINNING
METHOD = STEP
OBSERVED = (BEGINNING, BEGINNING);
ID date;
CONVERT index /;
RUN;
QUIT;
This filled the gaps but from the previous row and whatever I set for ALIGN, OBSERVED or even sorting the data descending I do not achieve the behavior I want.
If you know how to make it right it would be great if you could give me a hint. Good papers on proc expand are apprechiated as well.
Thanks for your help and kind regards
Stephan
I don't know about proc expand. But apparently this can be done with a few steps.
Read the dataset and create a new variable that will get the value of n.
data have;
set have;
pos = _n_;
run;
Sort this dataset by this new variable, in descending order.
proc sort data=have;
by descending pos;
run;
Use Lag or retain to fill the missing values from the "next" row (After sorting, the order will be reversed).
data want;
set have (rename=(index=index_old));
retain index;
if not missing(index_old) then index = index_old;
run;
Sort back if needed.
proc sort data=want;
by pos;
run;
I'm no PROC EXPAND expert but this is what I came up with. Create LEADS for the maximum gap run (2) then coalesce them into INDEX.
data index;
infile cards dsd dlm=';';
input date:date11. index;
format date date11.;
cards4;
29.Jun09;-1693
30.Jun09;-1692
01.Jul09;-1691
02.Jul09;-1690
03.Jul09;-1689
04.Jul09;.
05.Jul09;.
06.Jul09;-1688
07.Jul09;-1687
08.Jul09;-1686
09.Jul09;-1685
10.Jul09;-1684
11.Jul09;.
12.Jul09;.
13.Jul09;-1683
;;;;
run;
proc print;
run;
PROC EXPAND DATA=index OUT=index2 method=none;
ID date;
convert index=lead1 / transform=(lead 1);
CONVERT index=lead2 / transform=(lead 2);
RUN;
QUIT;
proc print;
run;
data index3;
set index2;
pocb = coalesce(index,lead1,lead2);
run;
proc print;
run;
Modified to work for any reasonable gap size.
data index;
infile cards dsd dlm=';';
input date:date11. index;
format date date11.;
cards4;
27.Jun09;
28.Jun09;
29.Jun09;-1693
30.Jun09;-1692
01.Jul09;-1691
02.Jul09;-1690
03.Jul09;-1689
04.Jul09;.
05.Jul09;.
06.Jul09;-1688
07.Jul09;-1687
08.Jul09;-1686
09.Jul09;-1685
10.Jul09;-1684
11.Jul09;.
12.Jul09;.
13.Jul09;-1683
14.Jul09;
15.Jul09;
16.Jul09;
17.Jul09;-1694
;;;;
run;
proc print;
run;
/* find the largest gap */
data gapsize(keep=n);
set index;
by index notsorted;
if missing(index) then do;
if first.index then n=0;
n+1;
if last.index then output;
end;
run;
proc summary data=gapsize;
output out=maxgap(drop=_:) max(n)=maxgap;
run;
/* Gen the convert statement for LEADs */
filename FT67F001 temp;
data _null_;
file FT67F001;
set maxgap;
do i = 1 to maxgap;
put 'Convert index=lead' i ' / transform=(lead ' i ');';
end;
stop;
run;
proc expand data=index out=index2 method=none;
id date;
%inc ft67f001;
run;
quit;
data index3;
set index2;
pocb = coalesce(index,of lead:);
drop lead:;
run;
proc print;
run;