SAS help - median of values in an array - sas

I am relatively new to SAS but have done a fair amount of programming over the years. I am at a loss on how to accomplish a task in SAS that I feel I would be able to do relatively easily in other platforms. I have an input table similar to this:
City
_1988
_1989
_1990
_1991
_1992
_1993
_1994
_1995
_1996
_1997
_1998
_1999
_2000
Columbus
438866
437148
16082
475843
224411
411569
658459
174208
592418
31664
312374
242830
342950
Fargo
11218
7402
35574
14765
64727
29492
104541
616
57864
73451
96251
78803
34743
Santa Fe
10608
31531
46163
28215
62608
52576
55674
43339
34896
77851
41304
31308
60306
Poughkeepsie
2184
15642
13505
9279
22796
6458
3279
4458
19672
17610
2672
11454
1072
Montpelier
1428
671
520
5453
5468
2117
2802
5847
3165
6204
1832
5357
5499
Waco
12527
695
44426
61651
83997
12811
50570
15022
86732
38541
45292
120719
17969
Nashville
359806
249811
422314
151319
466174
107335
315576
571273
195685
230626
194663
11060
545940
Billings
49694
37415
38602
79238
65260
18497
8976
81148
71326
108760
43740
48110
32106
Pensacola
4501
9682
19061
14731
4623
16106
13419
47607
9198
25003
39303
45146
24143
Trenton
40341
21210
4162
57773
16937
60495
21508
80819
27349
65088
65815
66308
38151
I would like to find the median of all the differences in values for each city.
The basic logic is I need to obtain the median of all the values in the array "difference" in the pseudo-code below.
for i = 1988 to 2000
for j = i+1 to 2000
difference(i,j) = value year_i - value year_j
end
end
I wish I could paste my sample code here, but I am basically at a point of writers block where what I have produced is so far off that it is of no use. I don't necessarily need someone to write the entire code for me but am hoping somebody can send me down the right path. I feel like this shouldn't be that hard, but I am at a loss . . .
Thanks in advance!

I'm not entirely following what you're trying to do here, but I think this will get you going.
First, I create the temp data with pairwise differences as you request. Then I use the Proc Summary to calculate the median across city and year.
Feel free to ask.
/* sample data */
data have;
input City :$12. _1988 - _2000;
infile datalines dlm = '|';
datalines;
Columbus |438866|437148|16082 |475843|224411|411569|658459|174208|592418|31664 |312374|242830|342950
Fargo |11218 |7402 |35574 |14765 |64727 |29492 |104541|616 |57864 |73451 |96251 |78803 |34743
Santa Fe |10608 |31531 |46163 |28215 |62608 |52576 |55674 |43339 |34896 |77851 |41304 |31308 |60306
Poughkeepsie|2184 |15642 |13505 |9279 |22796 |6458 |3279 |4458 |19672 |17610 |2672 |11454 |1072
Montpelier |1428 |671 |520 |5453 |5468 |2117 |2802 |5847 |3165 |6204 |1832 |5357 |5499
Waco |12527 |695 |44426 |61651 |83997 |12811 |50570 |15022 |86732 |38541 |45292 |120719|17969
Nashville |359806|249811|422314|151319|466174|107335|315576|571273|195685|230626|194663|11060 |545940
Billings |49694 |37415 |38602 |79238 |65260 |18497 |8976 |81148 |71326 |108760|43740 |48110 |32106
Pensacola |4501 |9682 |19061 |14731 |4623 |16106 |13419 |47607 |9198 |25003 |39303 |45146 |24143
Trenton |40341 |21210 |4162 |57773 |16937 |60495 |21508 |80819 |27349 |65088 |65815 |66308 |38151
;
/* pairwise differences in long format */
data temp;
set have;
array y1(i) _1988 - _2000;
array y2(j) _1988 - _2000;
do over y1;
do over y2;
year1 = input(compress(vname(y1), , 'kd'), 8.);
val1 = y1;
year2 = input(compress(vname(y2), , 'kd'), 8.);
val2 = y2;
diff = val1 - val2;
if i ne j then output;
end;
end;
drop _:;
run;
/* calculate median */
proc summary data = temp nway;
class city year1;
var diff;
output out = want(drop = _:) median =;
run;

You can use a 13-element array to hold the year values. And a 13x13 temporary array to hold the differences. Then median(of ARRAY(*)) to get the median.
* create the sample data;
data have;
input City $ _1988 _1989 _1990 _1991 _1992 _1993 _1994 _1995 _1996 _1997 _1998 _1999 _2000;
datalines;
Columbus 438866 437148 16082 475843 224411 411569 658459 174208 592418 31664 312374 242830 342950
Fargo 11218 7402 35574 14765 64727 29492 104541 616 57864 73451 96251 78803 34743
Santa-Fe 10608 31531 46163 28215 62608 52576 55674 43339 34896 77851 41304 31308 60306
Poughkeepsie 2184 15642 13505 9279 22796 6458 3279 4458 19672 17610 2672 11454 1072
Montpelier 1428 671 520 5453 5468 2117 2802 5847 3165 6204 1832 5357 5499
Waco 12527 695 44426 61651 83997 12811 50570 15022 86732 38541 45292 120719 17969
Nashville 359806 249811 422314 151319 466174 107335 315576 571273 195685 230626 194663 11060 545940
Billings 49694 37415 38602 79238 65260 18497 8976 81148 71326 108760 43740 48110 32106
Pensacola 4501 9682 19061 14731 4623 16106 13419 47607 9198 25003 39303 45146 24143
Trenton 40341 21210 4162 57773 16937 60495 21508 80819 27349 65088 65815 66308 38151
;
run;
data want;
set have;
* create an array to hold the year variables;
array years {1988:2000} _1988 - _2000;
* create a 2-dimensional array to hold the differences;
array differences {1988:2000,1988:2000} _temporary_;
do i = 1988 to 20000;
do j = i + 1 to 2000;
* calculate the differences as per pseudo-code in question;
differences(i,j) = years(i) - years(j);
end;
end;
* get median value;
median_diff = median(of differences(*));
run;

Related

How to make my output to read as yes or no?

I want to display my data as either yes or no in the output for initaltesting, site visit, and follow up, how would I do that? There are numeric values for this on the data set but want character responses of "y" or "n"
PROC FORMAT;
VALUE SiteVisitfmt 1 = 'yes'
0 = 'no';
VALUE InitialTestingfmt 1 = 'yes'
2 = 'no';
VALUE TestEventfmt 1 = 'One Event '
2 = 'Two Events'
3 = 'Three Events'
4 = 'Four Events'
5 = 'Five Events';
VALUE FollowUpfmt 1 = 'yes'
0 = 'no';
FORMAT SiteVisit SiteVisitfmt. InitialTesting InitialTestingfmt. TestEvent TestEventfmt.
FollowUp FollowUpfmt.;
RUN;
data PMdataedits;
set PMdata (rename = (Number_of_Days_from_Onset_to_Sit =SiteVisit
Number_of_Days_between_Onset_and = InitialTesting
Number_of_Test_Events_in_IRIS = TestEvent
Number_of_Days_between_Test_1_an = FollowUp));
drop SPA;
attrib date1 format=date9.;
date1=input(date,mmddyy10.);
NewSiteVisit = put(SiteVisit, 8.);
NewInitialTesting = put(InitialTesting, 8.);
NewFollowUp = put(FollowUp, 8.);
NewSiteVisit=;
if (NewSiteVisit=<1) THEN NewSiteVisit= '1';
if (NewSiteVisit>1) THEN NewSiteVist= '0';
NewInitialTesting=;
if (NewInitialTesting<=2) THEN NewInitialTesting= '1';
if (NewInitialTesting>2) THEN NewInitialTesting='0';
This statement:
FORMAT SiteVisit SiteVisitfmt. InitialTesting InitialTestingfmt. TestEvent TestEventfmt.
FollowUp FollowUpfmt.;
Needs to be on the data step (sometime after data PMdataedits; but before the run; that you don't show), not in the proc format. That's the statement that assigns the format to a variable; each dataset (which is defined by a data step) has its own, unique set of variables that can be the same name as other datasets but have different contents and formats.
Also note that you don't have to name the formats after the variables, and don't need three different yes/no formats. You could have done:
proc format;
format ynf
'1'='yes'
'0'='no'
;
run;
And then used
format sitevisit initialtesting followup ynf.;
And that would have covered all three of them with one format. But what you did is legal, it's just more typing than you need!

How to format dates for use in SAS?

I am trying to adapt Method 4 in this paper to calculate the duration of many observations, but discounting overlapping dates: https://support.sas.com/resources/papers/proceedings/proceedings/sugi31/048-31.pdf
For example, two rows of observations for subject 101 lasting from 2017-03-02 to 2017-03-16 and 2017-03-04 to 2017-03-17 respectively should return a value of only 16 days.
I am getting an error with the dates being 'Invalid numeric data', though, resulting in later errors. I have tried format startdate yyyymmdd10.; and format stopdate yyyymmdd10.; with no success.
Can anyone help me properly format my dates for use here, or identify any further errors?
Edit: Line 80 refers to do xdate = startdate to stopdate;.
I am still unable to convert or create the date variables as numeric/date values. I have used the following code:
data sasuser.Mdm;
set sasuser.Mdm;
do xdate = input(Startdate,yymmdd10.) to input(stopdate,yymmdd10.);
put xdate= yymmdd10.;
output;
end;
run;
To get this output:
1 data sasuser.Mdm;
2 set sasuser.Mdm;
3 do xdate = input(Startdate,yymmdd10.) to input(stopdate,yymmdd10.);
4 put xdate= yymmdd10.;
5 output;
6 end;
7 run;
xdate=2017-03-02
xdate=2017-03-03
xdate=2017-03-04
xdate=2017-03-05
xdate=2017-03-06
xdate=2017-03-07
xdate=2017-03-08
xdate=2017-03-09
xdate=2017-03-10
xdate=2017-03-11
xdate=2017-03-12
xdate=2017-03-13
xdate=2017-03-14
xdate=2017-03-15
xdate=2017-03-16
xdate=2017-03-04
xdate=2017-03-05
xdate=2017-03-06
xdate=2017-03-07
xdate=2017-03-08
xdate=2017-03-09
xdate=2017-03-10
xdate=2017-03-11
xdate=2017-03-12
xdate=2017-03-13
xdate=2017-03-14
xdate=2017-03-15
xdate=2017-03-16
xdate=2017-03-17
xdate=2017-03-07
xdate=2017-03-08
xdate=2017-03-09
xdate=2017-03-10
xdate=2017-03-11
xdate=2017-03-12
xdate=2017-03-13
xdate=2017-03-14
xdate=2017-03-15
xdate=2017-03-16
xdate=2017-03-17
xdate=2017-03-18
xdate=2017-03-19
xdate=2017-03-20
xdate=2017-03-21
xdate=2017-02-08
xdate=2017-02-09
xdate=2017-02-10
xdate=2017-02-11
xdate=2017-02-12
xdate=2017-02-13
xdate=2017-02-14
xdate=2017-02-15
xdate=2017-02-16
xdate=2017-02-17
xdate=2017-02-18
xdate=2017-02-19
xdate=2017-02-20
xdate=2017-02-21
xdate=2017-02-22
xdate=2017-02-23
xdate=2017-02-24
xdate=2017-02-23
xdate=2017-02-24
xdate=2017-02-25
xdate=2017-02-26
xdate=2017-02-27
xdate=2017-02-28
xdate=2017-03-01
xdate=2017-03-02
xdate=2017-03-03
xdate=2017-03-04
xdate=2017-03-05
xdate=2017-03-06
xdate=2017-03-07
xdate=2017-03-08
xdate=2017-02-26
xdate=2017-02-28
xdate=2017-03-13
xdate=2017-03-17
xdate=2017-03-25
xdate=2017-03-28
xdate=2017-03-23
xdate=2017-03-24
xdate=2017-03-25
xdate=2017-03-26
xdate=2017-03-27
xdate=2017-03-28
xdate=2017-03-29
xdate=2017-03-30
xdate=2017-03-29
xdate=2017-04-03
xdate=2017-04-04
xdate=2017-04-03
xdate=2017-04-04
xdate=2017-04-05
xdate=2017-04-05
xdate=2017-04-06
xdate=2017-04-06
xdate=2017-04-07
xdate=2017-03-25
xdate=2017-03-26
xdate=2017-03-30
xdate=2017-04-01
xdate=2017-04-02
xdate=2017-04-03
xdate=2017-04-04
xdate=2017-04-08
xdate=2017-04-09
xdate=2017-04-10
xdate=2017-04-11
xdate=2017-04-12
xdate=2017-04-12
xdate=2017-04-13
xdate=2017-04-13
xdate=2017-04-14
xdate=2017-04-15
xdate=2017-04-16
xdate=2017-04-17
xdate=2017-04-18
xdate=2017-04-19
xdate=2017-04-20
xdate=2017-04-21
xdate=2017-04-22
xdate=2017-04-19
xdate=2017-04-23
xdate=2017-04-24
xdate=2017-04-25
xdate=2017-04-26
xdate=2017-04-26
xdate=2017-04-27
xdate=2017-04-28
xdate=2017-05-05
xdate=2017-05-06
xdate=2017-05-16
xdate=2017-05-19
xdate=2017-05-20
xdate=2017-05-21
xdate=2017-05-22
xdate=2017-05-19
xdate=2017-05-20
xdate=2017-05-21
xdate=2017-05-22
xdate=2017-05-23
xdate=2017-05-24
xdate=2017-05-25
xdate=2017-05-26
xdate=2017-05-22
xdate=2017-05-23
xdate=2017-05-24
xdate=2017-05-25
xdate=2017-05-26
xdate=2017-05-27
xdate=2017-05-28
xdate=2017-05-29
xdate=2017-05-30
xdate=2017-05-31
xdate=2017-06-01
xdate=2017-06-02
xdate=2017-06-03
xdate=2017-06-04
xdate=2017-06-05
xdate=2017-06-06
xdate=2017-06-07
xdate=2017-06-08
xdate=2017-06-09
xdate=2017-06-10
xdate=2017-06-11
xdate=2017-06-12
xdate=2017-06-13
xdate=2017-06-14
xdate=2017-06-15
xdate=2017-06-16
xdate=2017-06-17
xdate=2017-06-18
xdate=2017-06-19
xdate=2017-06-20
xdate=2017-06-21
xdate=2017-06-22
xdate=2017-06-23
xdate=2017-06-24
xdate=2017-06-25
xdate=2017-06-26
xdate=2017-06-27
xdate=2017-06-28
xdate=2017-06-29
xdate=2017-06-30
xdate=2017-07-01
xdate=2017-07-02
xdate=2017-07-03
xdate=2017-07-04
xdate=2017-07-05
xdate=2017-07-06
xdate=2017-07-07
xdate=2017-07-08
xdate=2017-07-09
xdate=2017-07-10
xdate=2017-07-11
xdate=2017-07-12
xdate=2017-07-13
xdate=2017-07-14
xdate=2017-07-15
xdate=2017-07-16
xdate=2017-07-17
xdate=2017-07-18
xdate=2017-07-19
xdate=2017-07-20
xdate=2017-07-21
xdate=2017-07-22
xdate=2017-07-23
xdate=2017-07-24
xdate=2017-07-25
xdate=2017-07-26
xdate=2017-07-27
xdate=2017-07-28
xdate=2017-07-29
xdate=2017-07-30
xdate=2017-07-31
xdate=2017-08-01
xdate=2017-08-02
xdate=2017-08-03
xdate=2017-08-04
xdate=2017-08-05
xdate=2017-08-06
xdate=2017-08-07
xdate=2017-08-08
xdate=2017-08-09
xdate=2017-08-10
xdate=2017-08-11
xdate=2017-08-12
xdate=2017-08-13
xdate=2017-08-14
xdate=2017-08-15
xdate=2017-08-16
xdate=2017-08-17
xdate=2017-08-18
xdate=2017-08-19
xdate=2017-08-20
xdate=2017-08-21
xdate=2017-08-22
xdate=2017-08-23
xdate=2017-08-24
xdate=2017-08-25
xdate=2017-08-26
xdate=2017-08-27
xdate=2017-08-28
xdate=2017-08-29
xdate=2017-08-30
xdate=2017-08-31
xdate=2017-09-01
xdate=2017-05-27
xdate=2017-05-28
xdate=2017-05-29
xdate=2017-05-30
xdate=2017-05-31
xdate=2017-06-01
xdate=2017-06-02
xdate=2017-06-03
xdate=2017-06-04
xdate=2017-06-05
xdate=2017-06-06
xdate=2017-06-07
xdate=2017-06-08
xdate=2017-06-09
xdate=2017-06-10
xdate=2017-06-11
xdate=2017-06-12
xdate=2017-06-13
xdate=2017-06-14
xdate=2017-06-15
xdate=2017-06-16
xdate=2017-06-17
xdate=2017-06-18
xdate=2017-06-19
xdate=2017-06-20
xdate=2017-06-21
xdate=2017-06-22
xdate=2017-06-23
xdate=2017-06-24
xdate=2017-06-25
xdate=2017-06-26
xdate=2017-06-27
xdate=2017-06-28
xdate=2017-06-29
xdate=2017-06-30
xdate=2017-07-01
xdate=2017-07-02
xdate=2017-07-03
xdate=2017-07-04
xdate=2017-07-05
xdate=2017-07-06
xdate=2017-07-07
xdate=2017-07-08
xdate=2017-07-09
xdate=2017-07-10
xdate=2017-07-11
xdate=2017-07-12
xdate=2017-07-13
xdate=2017-07-14
xdate=2017-07-15
xdate=2017-07-16
xdate=2017-07-17
xdate=2017-07-18
xdate=2017-07-19
xdate=2017-07-20
xdate=2017-07-21
xdate=2017-07-22
xdate=2017-07-23
xdate=2017-07-24
xdate=2017-07-25
xdate=2017-07-26
xdate=2017-07-27
xdate=2017-07-28
xdate=2017-07-29
xdate=2017-07-30
xdate=2017-07-31
xdate=2017-08-01
xdate=2017-08-02
xdate=2017-08-03
xdate=2017-08-04
xdate=2017-08-05
xdate=2017-08-06
xdate=2017-08-07
xdate=2017-08-08
xdate=2017-08-09
xdate=2017-08-10
xdate=2017-08-11
xdate=2017-08-12
xdate=2017-08-13
xdate=2017-08-14
xdate=2017-08-15
xdate=2017-08-16
xdate=2017-08-17
xdate=2017-08-18
xdate=2017-08-19
xdate=2017-08-20
xdate=2017-08-21
xdate=2017-08-22
xdate=2017-08-23
xdate=2017-08-24
xdate=2017-08-25
xdate=2017-08-26
xdate=2017-08-27
xdate=2017-08-28
xdate=2017-08-29
xdate=2017-08-30
xdate=2017-08-31
xdate=2017-09-01
xdate=2017-06-14
xdate=2017-06-15
xdate=2017-06-16
xdate=2017-06-17
xdate=2017-06-18
xdate=2017-06-19
xdate=2017-06-20
xdate=2017-06-21
xdate=2017-06-22
xdate=2017-06-23
xdate=2017-06-24
xdate=2017-06-25
xdate=2017-06-26
xdate=2017-06-27
xdate=2017-06-28
xdate=2017-06-29
xdate=2017-06-14
xdate=2017-06-15
xdate=2017-06-16
xdate=2017-06-17
xdate=2017-06-18
xdate=2017-06-19
xdate=2017-06-20
xdate=2017-06-21
xdate=2017-06-22
xdate=2017-06-23
xdate=2017-06-24
xdate=2017-06-25
xdate=2017-06-26
xdate=2017-06-27
xdate=2017-06-28
xdate=2017-06-29
xdate=2017-03-27
xdate=2017-04-02
xdate=2017-04-07
xdate=2017-04-08
xdate=2017-04-09
xdate=2017-04-13
xdate=2017-04-14
xdate=2017-04-15
xdate=2017-04-16
xdate=2017-04-17
xdate=2017-04-19
xdate=2017-04-20
xdate=2017-04-21
xdate=2017-04-22
xdate=2017-04-23
xdate=2017-04-24
xdate=2017-04-20
xdate=2017-04-21
xdate=2017-04-22
xdate=2017-04-23
xdate=2017-04-24
xdate=2017-04-25
xdate=2017-04-26
xdate=2017-04-27
xdate=2017-04-28
xdate=2017-04-29
xdate=2017-04-30
xdate=2017-05-01
xdate=2017-05-02
xdate=2017-04-24
xdate=2017-04-25
xdate=2017-04-26
xdate=2017-04-27
xdate=2017-04-28
xdate=2017-04-29
xdate=2017-04-30
xdate=2017-05-01
xdate=2017-05-02
xdate=2017-05-03
xdate=2017-05-04
xdate=2017-05-05
xdate=2017-05-06
xdate=2017-05-07
xdate=2017-05-08
xdate=2017-05-09
xdate=2017-05-10
xdate=2017-05-11
xdate=2017-05-12
xdate=2017-05-13
xdate=2017-05-14
xdate=2017-05-15
xdate=2017-05-16
ERROR: Invalid DO loop control information, either the INITIAL or TO expression is missing or the
BY expression is missing, zero, or invalid.
SUBJID=106 KEY=106-9 OBS=9 TOTAL=12 STARTDATE=2017-04-25 STOPDATE= CLASS=Steroid / Diuretic
xdate=20934 _ERROR_=1 _N_=52
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 52 observations read from the data set SASUSER.MDM.
WARNING: The data set SASUSER.MDM may be incomplete. When this step was stopped there were 431
observations and 8 variables.
WARNING: Data set SASUSER.MDM was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
real time 0.38 seconds
cpu time 0.29 seconds```
I don't understand why input doesn't appear to be working. Dates are still listed as character strings under column attributes. The do part also isn't working as intended. I'd be grateful for any further guidance.
Do not use the same name in the DATA and SET statement. Then you're always having to rebuild from the start.
Convert your start and stop date to SAS dates
Remove PUT
Add formats to see them displayed as desired
Drop old variables to avoid confusion.
Your two code steps, the data step and SQL do not appear related. Not sure why you would even need a list of dates for intervals or anything. There are much better ways to calculate an overlap. I think you're putting us through an xy problem where it would be significantly easier to show us what you're attempting to do and people would be able to provide a much better solution.
data sasuser.Mdm2; /*1*/
set sasuser.Mdm;
/*2*/
start_date = input(startdate, yymmdd10.);
end_date = input(stopdate, yymmdd10.);
do xdate = start_date to stop_date;
output; /*3*/
end;
/*4*/
format start_date end_date xDate yymmdd10.;
/*5*/
drop startdate stopdate;
run;
*check;
proc contents data=sasuser.mdm2;
run;
EDIT: Also, if you had some sort of grouping variable to indicate that these were part of the same episode you could then just take the min/max of the dates and subtract them to get the interval duration for starters. Grouping via a data step is trivial.
data want;
set have;
by id;
retain episode;
start_date = input(start_date, yymmdd10.);
end_date = input(stopdate, yymmdd10.);
prev_stop_date = lag(stopDate);
if first.id then do;
episode = 0;
call missing(prev_stop_date);
end;
if not (start_date <=prev_stop_date <= end_date) then episode+1;
*could add in logic to calculate dates and durations as well depending....;
run;
It sounds like your SAS log is complaining about this statement.
do xdate=startdate to stopdate;
Because STARTDATE and STOPDATE are character strings instead of dates.
Make sure to create your date values as dates instead of character strings.
Tom's correct, of course, the startdate and stopdate seem to be characters.
To properly use this, do something like this (only the do loop is relevant for you, the rest is to show it working):
data _null_;
startdate = '2017-03-02';
stopdate = '2017-03-16';
do xdate = input(Startdate,yymmdd10.) to input(stopdate,yymmdd10.);
put xdate= yymmdd10.; *just put to the log to see what you are getting;
end;
run;
input will convert the text to a numeric value. Do realize you have to format that xdate as a date format if you want to be able to view it - if you're just using it as an input, though, you can leave the formatting off.

Calculating value weighted returns in SAS

I have some data in the following format:
COMPNAME DATA CAP RETURN
I have found some code that will construct and calculate the value-weighted return based on the data.
This works great and is below:
PROC SUMMARY NWAY DATA = Data1 ; CLASS DATE ;
VAR RETURN / WEIGHT = CAP ;
OUTPUT
OUT = MKTRET
MEAN (RETURN) = MONTHLYRETURN
RUN;
The extension that I would like to make is in my head a little bit complicated.
I want to make the weights based on the market capitalization in June.
So this will be a buy and hold portfolios. The actual data has 100's of companies but to give a representative example for two companies with the sole explanation of how the weights will evolve...
Say for example I have two companies, A and B.
The CAP of A is £100m and B is £100m.
In July of one year, I would invest 50% in A and 50% in B.
The returns in July are 10% and -10%.
Therefore I would invest 55% and 45%.
It will go on like this until next June when I will re-balance again based on the market capitalisation...
10% monthly return is pretty speculative!
When the two companies differ by more than 200 you will need to also sell and buy to equalize the companies.
Presume the rates per month are simulated and stored in a data set. You can generate a simulated ledger as follows
add returns
compare balances
equalize by splitting 200 investment if balances are close enough
equalize by investing all 200 in one and selling and buying
Of course, a portfolio with more than 2 companies becomes a more complicated balancing act to achieve mathematical balance.
data simurate(label="Future expectation is not an indicator of past performance :)");
do month = 1 to 60;
do company = 1 to 2;
return = round (sin(company+month/4) / 12, 0.001); %* random return rate for month;
output;
end;
end;
run;
data want;
if 0 then set simurate;
declare hash lookup (dataset:'simurate');
lookup.defineKey ('company', 'month');
lookup.defineData('return');
lookup.defineDone();
month = 0;
bal1 = 0; bal2 = 0;
output;
do month = 1 to 60;
lookup.find(key:1, key:month); rate1 = return;
ret1 = round(bal1 * rate1, 0.0001);
lookup.find(key:2, key:month); rate2 = return;
ret2 = round(bal1 * rate2, 0.0001);
bal1 + ret1;
bal2 + ret2;
goal = mean(bal1,bal2) + 100;
sel1 = 0; buy1 = 0;
sel2 = 0; buy2 = 0;
if abs(bal1-bal2) <= 200 then do;
* difference between balances after returns is < 200;
* balances can be equalized simple investment split;
inv1 = goal - bal1;
inv2 = goal - bal2;
end;
else if bal1 < bal2 then do;
* sell bal2 as needed to equalize;
inv1 = 200;
inv2 = 0;
buy1 = goal - 200 - bal1;
sel2 = bal2 - goal;
end;
else do;
inv2 = 200;
inv1 = 0;
buy2 = goal - 200 - bal2;
sel1 = bal1 - goal;
end;
bal1 + (buy1 - sel1 + inv1);
bal2 + (buy2 - sel2 + inv2);
output;
end;
stop;
drop company return ;
format bal: 10.4 rate: 5.3;
run;

how to calculate weighted average but exclude the object itself using SAS

There are four variables in my dataset. Company shows the company's name. Return is the return of Company at day Date. Weight is the weight of this company in the market.
I want to keep all variables in the original file, and create an additional variable which is the market return (exclude Company itself). Market return corresponding for stock 'a' is the sum of all weighted stocks' return at the same Date in the market exclude stock a. For example, if there are 3 stocks in the market a, b and c. Market Return for stock a is Return(b)* [Weight(b)/(weight(b)+weight(C))] + Return(C)* [weight(C)/(weight(b)+weight(C)]. Similarly, Market Return for stock b is Return(a)* [Weight(a)/(weight(a)+weight(C))] + Return(C)* [weight(C)/(weight(a)+weight(C)].
I try to use proc summary but this function cannot exclude stock a when calculate the market return for stock a.
PROC SUMMARY NWAY DATA ;
CLASS Date ;
VAR Return / WEIGHT = weight;
OUTPUT
OUT = output
MEAN (Return) = MarketReturn;
RUN;
Could anyone teach me how to solve this please. I am relatively new to this software, so I dont know if I should use loop or there might be some better alternative.
This can be done with a bit of fancy algebra. It's not something that's built-in, though.
Basically:
Construct a "total" market return
Construct a stock by stock return (so just return of A)
Subtract out the portion that A contributes to total.
Thanks to the simple math that generates these lists, it's quite easy to do this.
Total sum = ((mean of A*Awgt) + (mean of remainder*sum of their weights))/(sum of Awgt + sum of rest wgts)
So, solve that for (mean of rest*mean of rest wgts / sum of rest wgts).
Exclusive sum: ((mean of all * sum of all wgts) - (mean of A * sum of A wgts)) / (sum of all wgts - sum of A wgts)
Something like this.
data returns;
input stock $ return weight;
datalines;
A .50 1
B .75 2
C .33 1
;;;;
run;
proc means data=returns;
class stock;
types () stock; *this is the default;
weight weight;
output out=means_out mean= sumwgt= /autoname;
run;
data returns_excl;
if _n_=1 then set means_out(where=(_type_=0) rename=(return_mean=tot_return return_sumwgt=tot_wgts));
set means_out(where=(_type_=1));
return_excl = (tot_return*tot_wgts-return_mean*return_sumwgt)/(tot_wgts-return_sumwgt);
run;

Stata: Subsetting data using criteria stored in other data set

I have a large data set. I have to subset the data set (Big_data) by using values stored in other dta file (Criteria_data). I will show you the problem first:
**Big_data** **Criteria_data**
==================== ================================================
lon lat 4_digit_id minlon maxlon minlat maxlat
-76.22 44.27 0765 -78.44 -77.22 34.324 35.011
-67.55 33.19 6161 -66.11 -65.93 40.32 41.88
....... ........
(over 1 million obs) (271 observations)
==================== ================================================
I have to subset the bid data as follows:
use Big_data
preserve
keep if (-78.44<lon<-77.22) & (34.324<lat<35.011)
save data_0765, replace
restore
preserve
keep if (-66.11<lon<-65.93) & (40.32<lat<41.88)
save data_6161, replace
restore
....
(1) What should be the efficient programming for the subsetting in Stata? (2) Are the inequality expressions correctly written?
1) Subsetting data
With 400,000 observations in the main file and 300 in the reference file, it takes about 1.5 minutes. I can't test this with double the observations in the main file because the lack of RAM takes my computer to a crawl.
The strategy involves creating as many variables as needed to hold the reference latitudes and longitudes (271*4 = 1084 in the OP's case; Stata IC and up can handle this. See help limits). This requires some reshaping and appending. Then we check for those observations of the big data file that meet the conditions.
clear all
set more off
*----- create example databases -----
tempfile bigdata reference
input ///
lon lat
-76.22 44.27
-66.0 40.85 // meets conditions
-77.10 34.8 // meets conditions
-66.00 42.0
end
expand 100000
save "`bigdata'"
*list
clear all
input ///
str4 id minlon maxlon minlat maxlat
"0765" -78.44 -75.22 34.324 35.011
"6161" -66.11 -65.93 40.32 41.88
end
drop id
expand 150
gen id = _n
save "`reference'"
*list
*----- reshape original reference file -----
use "`reference'", clear
tempfile reference2
destring id, replace
levelsof id, local(lev)
gen i = 1
reshape wide minlon maxlon minlat maxlat, i(i) j(id)
gen lat = .
gen lon = .
save "`reference2'"
*----- create working database -----
use "`bigdata'"
timer on 1
quietly {
forvalues num = 1/300 {
gen minlon`num' = .
gen maxlon`num' = .
gen minlat`num' = .
gen maxlat`num' = .
}
}
timer off 1
timer on 2
append using "`reference2'"
drop i
timer off 2
*----- flag observations for which conditions are met -----
timer on 3
gen byte flag = 0
foreach le of local lev {
quietly replace flag = 1 if inrange(lon, minlon`le'[_N], maxlon`le'[_N]) & inrange(lat, minlat`le'[_N], maxlat`le'[_N])
}
timer off 3
*keep if flag
*keep lon lat
*list
timer list
The inrange() function implies that the minimums and maximums must be adjusted beforehand to satisfy the OP's strict inequalities (the function tests <=, >=).
Probably some expansion using expand, use of correlatives and by (so data is in long form) could speed things up. It's not totally clear for me right now. I'm sure there are better ways in plain Stata mode. Mata may be even better.
(joinby was also tested but again RAM was a problem.)
Edit
Doing computations in chunks rather than for the complete database, significantly improves the RAM issue. Using a main file with 1.2 million observations and a reference file with 300 observations, the following code does all the work in about 1.5 minutes:
set more off
*----- create example big data -----
clear all
set obs 1200000
set seed 13056
gen lat = runiform()*100
gen lon = runiform()*100
local sizebd `=_N' // to be used in computations
tempfile bigdata
save "`bigdata'"
*----- create example reference data -----
clear all
set obs 300
set seed 97532
gen minlat = runiform()*100
gen maxlat = minlat + runiform()*5
gen minlon = runiform()*100
gen maxlon = minlon + runiform()*5
gen id = _n
tempfile reference
save "`reference'"
*----- reshape original reference file -----
use "`reference'", clear
destring id, replace
levelsof id, local(lev)
gen i = 1
reshape wide minlon maxlon minlat maxlat, i(i) j(id)
drop i
tempfile reference2
save "`reference2'"
*----- create file to save results -----
tempfile results
clear all
set obs 0
gen lon = .
gen lat = .
save "`results'"
*----- start computations -----
clear all
* local that controls # of observations in intermediate files
local step = 5000 // can't be larger than sizedb
timer clear
timer on 99
forvalues en = `step'(`step')`sizebd' {
* load observations and join with references
timer on 1
local start = `en' - (`step' - 1)
use in `start'/`en' using "`bigdata'", clear
timer off 1
timer on 2
append using "`reference2'"
timer off 2
* flag observations that meet conditions
timer on 3
gen byte flag = 0
foreach le of local lev {
quietly replace flag = 1 if inrange(lon, minlon`le'[_N], maxlon`le'[_N]) & inrange(lat, minlat`le'[_N], maxlat`le'[_N])
}
timer off 3
* append to result database
timer on 4
quietly {
keep if flag
keep lon lat
append using "`results'"
save "`results'", replace
}
timer off 4
}
timer off 99
timer list
display "total time is " `r(t99)'/60 " minutes"
use "`results'"
browse
2) Inequalities
You ask if your inequalities are correct. They are in fact legal, meaning that Stata will not complain, but the result is probably unexpected.
The following result may seem surprising:
. display (66.11 < 100 < 67.93)
1
How is it the case that the expression evaluates to true (i.e. 1) ? Stata first evaluates 66.11 < 100 which is true, and then sees 1 < 67.93 which is also true, of course.
The intended expression was (and Stata will now do what you want):
. display (66.11 < 100) & (100 < 67.93)
0
You can also rely on the function inrange().
The following example is consistent with the previous explanation:
. display (66.11 < 100 < 0)
0
Stata sees 66.11 < 100 which is true (i.e. 1) and follows up with 1 < 0, which is false (i.e. 0).
This uses Roberto's data setup:
clear all
set obs 1200000
set seed 13056
gen lat = runiform()*100
gen lon = runiform()*100
local sizebd `=_N' // to be used in computations
tempfile bigdata
save "`bigdata'"
*----- create example reference data -----
clear all
set obs 300
set seed 97532
gen minlat = runiform()*100
gen maxlat = minlat + runiform()*5
gen minlon = runiform()*100
gen maxlon = minlon + runiform()*5
gen id = _n
tempfile reference
save "`reference'"
timer on 1
levelsof id, local(id_list)
foreach id of local id_list {
sum minlat if id==`id', meanonly
local minlat = r(min)
sum maxlat if id==`id', meanonly
local maxlat = r(max)
sum minlon if id==`id', meanonly
local minlon = r(min)
sum maxlon if id==`id', meanonly
local maxlon = r(max)
preserve
use if (inrange(lon,`minlon',`maxlon') & inrange(lat,`minlat',`maxlat')) using "`bigdata'", clear
qui save data_`id', replace
restore
}
timer off 1
I would try to avoid preserveing and restoreing the "big" file, and doing so is possible, but at the expense of losing Stata format.
Using the same set up as Roberto and Dimitriy did,
set more off
use `bigdata', clear
merge 1:1 _n using `reference'
* check for data consistency:
* minlat, maxlat, minlon, maxlon are either all defined or all missing
assert inlist( mi(minlat) + mi(maxlat) + mi(minlon) + mi(maxlon), 0, 4)
* this will come handy later
gen byte touse = 0
* set up and cycle over the reference data
count if !missing(minlat)
forvalues n=1/`=r(N)' {
replace touse = inrange(lat,minlat[`n'],maxlat[`n']) & inrange(lon,minlon[`n'],maxlon[`n'])
local thisid = id[`n']
outfile lat lon if touse using data_`thisid'.csv, replace comma
}
Time it on your machine. You could avoid touse and thisid and only have the single outfile within the cycle, but it would be less readable.
You can then infile lat lon using data_###.csv, clear later. If you really need the Stata files proper, you can convert that swarm of CSV files with
clear
local allcsv : dir . files "*.csv"
foreach f of local allcsv {
* change the filename
local dtaname = subinstr(`"`f'"',".csv",".dta",.)
infile lat lon using `"`f'"', clear
if _N>0 save `"`dtaname'"', replace
}
Time it, too. I protected the save as some of the simulated data sets were empty. I think this was faster than 1.5 min on my machine, including the conversion.