I have longitudinal data, but I wish to combine rows if the value of one variable is the same, and update the time variable so that the start and finish time reflects the combined time period. At the end of this only the combined rows and unique rows are kept.
Here is an example
Data have:
Person
Start
Finish
Weight
A
1/1/1988
31/12/1988
78
A
1/1/1989
31/12/1989
78
A
1/1/1990
31/12/1990
78
A
1/1/1991
31/12/1991
81
A
1/1/1992
31/12/1992
82
A
1/1/1993
31/12/1993
82
B
1/1/1968
31/12/1968
56
B
1/1/1969
31/12/1969
55
B
1/1/1970
31/12/1970
55
Data want:
Person
Start
Finish
Weight
A
1/1/1988
31/12/1990
78
A
1/1/1991
31/12/1991
81
A
1/1/1992
31/12/1993
82
B
1/1/1968
31/12/1968
56
B
1/1/1969
31/12/1970
55
What would be the best way of doing this? Thank you all for your time!
The code below will produce your required results from the sample data. It relies on the data being correctly sorted by start and finish.
data want (keep=person start finish weight);
set have (rename= (start=original_start finish=original_finish)); * rename so that the names do not clash with the final variable names;
by person weight notsorted; * assumes that the data are sorted by START and FINISH dates;
retain start; * remembers this variable WHILE we read multiple rows;
format start finish ddmmyy10.;
if first.weight then start=original_start; * record the first START date of each combination of PERSON and WEIGHT;
if last.weight then do;
finish=original_finish; * record the last FINISH date;
output; * only output when we have read the last row for this combination of PERSON and WEIGHT;
end;
run;
If your real data has complications like overlapping time periods then you could use proc summary to get the result:
* use PROC SUMMARY to calculate minimum and maximum for START and FINISH. Only keep the minimum for START and the maximum for FINISH;
proc summary data=have nway;
class person weight;
var start finish;
output out=want2 (drop=_type_ _freq_ min_finish max_start) min=start min_finish max=max_start finish;
run;
or if you need to keep the rows in the same order you can replace class person weight; with by person weight notsorted; but this will cause issues if the rows containing the same person and weight values are not all together in the dataset.
proc summary data=have nway;
by person weight notsorted;
var start finish;
output out=want2 (drop=_type_ _freq_ min_finish max_start) min=start min_finish max=max_start finish;
run;
Suppose the data values for a person contain a time ordered pattern such as
X X X Y X X
and you want 3 rows, 1 for each contiguous period of same valued data values.
You can use DOW processing to compute the start and finish of each contiguous group.
Example:
data have;
input Person $ Start ddmmyy10. Finish ddmmyy10. Weight;
format start finish ddmmyy10.;
datalines;
A 1/1/1988 31/12/1988 78
A 1/1/1989 31/12/1989 78
A 1/1/1990 31/12/1990 78
A 1/1/1991 31/12/1991 81
A 1/1/1992 31/12/1992 82
A 1/1/1993 31/12/1993 82
B 1/1/1968 31/12/1968 56
B 1/1/1969 31/12/1969 55
B 1/1/1970 31/12/1970 55
;
proc sort data=have;
by person start;
data want(keep=person start finish weight);
do until (last.weight);
set have;
by person weight notsorted;
if first.weight then gstart=start;
end;
start = gstart;
run;
Related
During some data cleaning process, there is a need to compare the data between different rows. For example, if the rows have the same countryID and subjectID then keep the largest temperature:
CountryID SubjectID Temperature
1001 501 36
1001 501 38
1001 510 37
1013 501 36
1013 501 39
1095 532 36
In this case like this, I will use the lag() function as follows.
proc sort table;
by CountryID SubjectID descending Temperature;
run;
data table_laged;
set table;
CountryID_lag = lag(CountryID);
SubjectID_lag = lag(SubjectID);
Temperature_lag = lag(Temperature);
if CountryID = CountryID_lag and SubjectID = SubjectID_lag then do;
if Temperature < Temperature_lag then delete;
end;
drop CountryID_lag SubjectID_lag Temperature_lag;
run;
The code above may work.
But I still want to know if there are any better ways to solve this kind of questions?
I think you complicate task. You can use proc sql and max function:
proc sql noprint;
create table table_laged as
select CountryID,SubjectID,max(Temperature)
from table
group by CountryID,SubjectID;
quit;
I don't know if you want it that way but you code would keep the highest temperatures
So when you have 2 1 3 for one subject if will keep 3. But when you have 1 4 3 4 4 it will keep 4 4 4. Better is to keep simple the first row for each subject which is the highest because of descending order.
proc sort data = table;
by CountryID SubjectID descending Temperature;
run;
data table_laged;
set table;
by CountryID SubjectID;
if first.SubjectID;
run;
You can use double DOW technique to:
Compute a measure over a group,
Apply the measure to items in the group.
The benefit of DOW looping is a single pass over the data set when incoming data is already grouped.
In this question, 1. is to identify the row in the group with the first highest temperature, and 2. is to select the row for output.
data want;
do _n_ = 1 by 1 until (last.SubjectId);
set have;
by CountryId SubjectId;
if temperature > _max_temp then do;
_max_temp = temperature;
_max_at_n = _n_;
end;
end;
do _n_ = 1 to _n_;
set have;
if _n_ = _max_at_n then OUTPUT;
end;
drop _:;
run;
The traditional procedural technique is Proc MEANS
data have;input
CountryID SubjectID Temperature; datalines;
1001 501 36
1001 501 38
1001 510 37
1013 501 36
1013 501 39
1095 532 36
run;
proc means noprint data=have;
by countryid subjectid;
output out=want(drop=_:) max(temperature)=temperature;
run;
If the data is disordered in CountryID and SubjectID going into the data step, a hash object can be used or SQL per #Aurieli.
I have a dataset called stores.I want to extract total_sales(retail_price),
proportion of sales and cumulative proportion of sales by each store in
SAS.
Sample dataset : - Stores
Date Store_Postcode Retail_Price month Distance
08/31/2013 CR7 8LE 470 8 7057.8
10/26/2013 CR7 8LE 640 10 7057.8
08/19/2013 CR7 8LE 500 8 7057.8
08/17/2013 E2 0RY 365 8 1702.2
09/22/2013 W4 3PH 395.5 12 2522
06/19/2013 W4 3PH 360.5 6 1280.9
11/15/2013 W10 6HQ 475 12 3213.5
06/20/2013 W10 6HQ 500 1 3213.5
09/18/2013 E7 8NW 315 9 2154.8
10/23/2013 E7 8NW 570 10 5777.9
11/18/2013 W10 6HQ 455 11 3213.5
08/21/2013 W10 6HQ 530 8 3213.5
Code i tried: -
Proc sql;
Create table work.Top_sellers as
Select Store_postcode as Stores,SUM(Retail_price) as Total_Sales,Round((Retail_price/Sum(Retail_price)),0.01) as
Proportion_of_sales
From work.stores
Group by Store_postcode
Order by total_sales;
Quit;
I've no idea on how to calculate cumulative variable in proc sql...
Please help me improve my code!!
Computing a cumulative result in SQL requires the data to have an explicit unique ordered key and the query involves a reflexive join with 'triangular' criteria for the cumulative aspect.
data have;
do id = 100 to 120;
sales = ceil (10 + 25 * ranuni(123));
output;
end;
run;
proc sql;
create table want as
select
have1.id
, have1.sales
, sum(have2.sales) as sales_cusum
from
have as have1
join
have as have2
on
have1.id >= have2.id /* 'triangle' criteria */
group by
have1.id, have1.sales
order by
have1.id
;
quit;
A second way is re-compute the cusum on row by row basis
proc sql;
create table want as
select have.id, have.sales,
( select sum(inner.sales)
from (select * from have) as inner
where inner.id <= have.id
)
as cusum
from
have;
I change my mind, CDF is a different calculation.
Here's how to do this via a data step. First calculate the cumulative totals (I used a data step here, but I could use PROC EXPAND if you had SAS/ETS).
*sort demo data;
proc sort data=sashelp.shoes out=shoes;
by region sales;
run;
data cTotal last (keep = region cTotal);
set shoes;
by region;
*calculate running total;
if first.region then cTotal=0;
cTotal = cTotal + sales;
*output records, everything to cTotal but only the last record which is total to Last dataset;
if last.region then output last;
output cTotal;
retain cTotal;
run;
*merge in results and calculate percentages;
data calcs;
merge cTotal Last (rename=cTotal=Total);
by region;
percent = cTotal/Total;
run;
If you need a more efficient solution, I'd try a DoW solution.
I am trying to match max daily data within a month to a monthly data.
data daily;
input permno $ date ret;
datalines;
1000 19860101 88
1000 19860102 90
1000 19860201 70
1000 19860202 55
1001 19860201 97
1001 19860202 74
1001 19860203 79
1002 19860301 55
1002 19860302 100
1002 19860301 10
;
run;
data monthly;
input permno $ date ret;
datalines;
1000 19860131 1
1000 19860228 2
1000 19860331 5
1001 19860331 3
1002 19860430 4
;
run;
The result I want is the following; (I want to match daily max data to one month lag monthly data. )
1000 19860102 90 1000 19860228 2
1000 19860201 70 1000 19860331 5
1001 19860201 97 1001 19860331 3
1002 19860302 100 1002 19860430 4
Below is what I have tried so far.
I want to have maximum ret value within a month so I have created yrmon to assign same yyyymm data for the same month daily data
data a1; set daily;
yrmon=year(date)*100 + month(date);
run;
In order to choose the maximum value(here, ret) within same yrmon group for the same permno, I used code below
proc means data=a1 noprint;
class permno yrmon ;
var ret;
output out= a2 max=maxret;
run;
However, it only got me permno yrmon ret data, leaving the original date data away.
data a3;
set a2;
new=intnx('month',yrmon,1);
format date new yymmn6.;
run;
But it won't work since yrmon is no longer date format.
Thank you in advance.
Hello
I am trying to match two different sets by permno(same company) but with one month lag (eg. daily9 dataset yrmon=198601 and monthly2 dataset yrmon=198602)
it is pretty difficult to handle for me because if I just add +1 in yrmon, 198612 +1 will not be 198701 and I am confused with handling these issues.
Can anyone help?
1) informat date1/date2 yymmn6. is used to read the date in yyyymm format
2) format date1/date2 yymmn6. is used to view the date in yyyymm format
3) intnx("months",b.date2,-1) is used to join the dates with lag of 1 month
data data1;
input date1 value1;
informat date1 yymmn6.;
format date1 yymmn6.;
cards;
200101 200
200212 300
200211 400
;
run;
data data2;
input date2 value2;
informat date2 yymmn6.;
format date2 yymmn6.;
cards;
200101 3000000
200102 4000000
200301 2000000
200212 2000000
;
run;
proc sql;
create table result as
select a.*,b.date2,b.value2 from
data1 a
left join
data2 b
on a.date1 = intnx("months",b.date2,-1);
quit;
My Output:
date1 |value1 |date2 |value2
200101 |200 |200102 |4000000
200211 |400 |200212 |2000000
200212 |300 |200301 |2000000
Let me know in case of any queries.
Is it possible to make a new statistic with proc summary that multiplies every value in each column, for example instead of just mean? SAS is so rigid it makes me crazy.
data test;
input b c ;
datalines;
50 11
35 12
75 13
;
Desired output would be 50*35*75, and 11*12*13, and _FREQ (as is normal output in proc summary)
This is an uncommon aggregate so you essentially need to roll your own. Since a data step loops this is easily accomplished using a RETAIN to keep value from row to row and outputting result at the last record.
Data want;
Set have end=eof;
Retain prod_b prod_c;
prod_b = prod_b * b;
prod_c = prod_c * c;
Freq= _n_;
If eof then OUTPUT;
Keep prod: freq;
Run;
Here is the part of dataset:
Obs Buffer
...
75 14
76 13
77 64
78 38.1%
79 29.2%
80 69.2%
81 33
82 5-12
...
I only need the data containing "%" and the two rows ahead of this. For example, in this case I want to pull out "13" "64" "38.1%" "29.2%" and "69.2%".
Is there a way I can achieve this?
I like using point for this sort of thing. _N_ is reliable as a row counter as long as you're not doing anything funny with the data step loop.
data have;
length buffer $50;
input obs buffer $;
datalines;
75 14
76 13
77 64
78 38.1%
79 29.2%
80 69.2%
81 33
82 5-12
;;;;
run;
data want;
set have;
pointer=_N_;
if find(buffer,'%') then do;
output;
pointer=_N_-1;
set have point=pointer;
if not (find(buffer,'%')) then do;
output;
pointer=_N_-2;
set have point=pointer;
if not (find(buffer,'%')) then output;
end;
end;
run;
If you need to restore your order you can sort by pointer afterwards (or obs, if that is a real variable - I assume it is not). If obs is indeed a real variable (or if you make it real with a view), there is an interesting way in SQL to do this:
proc sql;
create table want as
select H.* from have H
left join have V on H.obs=V.obs-1
left join have A on H.obs=A.obs-2
where
(find(H.buffer,'%'))
or
(find(V.buffer,'%'))
or
(find(A.buffer,'%'))
order by H.obs
;
quit;
How to make obs real without a data pass:
data have_vw/view=have_vw;
set have;
obs=_n_;
run;
And then use have_vw instead of have in the SQL query (in all three spots).
To answer your question: the _N_ variable will return you the number of times that the data step has looped past the data statement. (http://support.sas.com/documentation/cdl/en/lrcon/62955/HTML/default/viewer.htm#a000695104.htm)
However, to solve your problem use lag() (http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000212547.htm) and a contains statement e.g.
data justBuffers;
set yourDS;
twoBefore = lag2(Buffer);
oneBefore = lag1(Buffer);
if Buffer ? '%' then do;
if not missing(twoBefore) then do;
x = twoBefore;
output;
end;
if not missing(oneBefore) then do;
x = oneBefore;
output;
end;
x = Buffer;
output;
call missing(oneBefore, twoBefore);
end;
keep x;
run;
I've not tested the code so watch out! I'm sure you could make it smoother.
By following kungfujam's thought, I get the code below and it works.
data adjust;
set source;
oneBefore = lag1(Buffer);
twoBefore = lag2(Buffer);
threeBefore = lag3(Buffer);
fourBefore = lag4(Buffer);
if index(buffer,'%')^=0 and index(onebefore,'%')^=0 and index(twobefore,'%')^=0then do;
x = fourBefore;
output;
x = threeBefore;
output;
x = twoBefore;
output;
x = oneBefore;
output;
x=buffer;
output;
end;
keep x;
run;