Do Loop and Count - sas

I have a dataset which look like this:
Patient_MRN DOV1_1 DOV2_1 DOV2_2 DOV2_3 DOV2_4 ... DOV2_200
234532 01/16/2018 08/09/2018
232536 01/01/2018 05/07/2018 06/07/2018 09/13/2019
I would like to count the number of times the DOV2_ variable falls within one calendar year of the DOV1_1 variable by Patient_MRN. So for example, the count for patient '234532' would be 1 and patient '232536' the count would be 2 (05/07/2018, 06/07/2018).
Not sure if this is the most efficient way but option I could use to do this is to subtract the DOV2_ variables from DOV1_1, and keep then keep only the outputs that are less than 365, and then create a count of the variables that are less than 365.
To do this, I know I need to create some sort of doop loop or marco but I'm not quite sure where to begin

The INTCK function can count date and time intervals.
The DATA Step ARRAY statement allows you to reference variables in an array like manner;
Presume 'within one calendar year' means the interval from DOV1_1 date to next date < 1 year forward. INTCK with a 'YEAR' interval and 'C'ontinuous method will return 0 when a date is within 1 year forward.
data want;
set have;
array dates dov2_1-dov2_200;
count = 0;
do index = 1 to dim(dates);
if intck('year', dov1_1, dates(index), 'C') = 0 then count + 1;
end;
run;

Related

Deleting missing values when dealing with panel data

I am working with a panel dataset, so many countries and many variables throughout a period. The problem is that some countries have no value for certain variables across the whole period and I would like to get rid of them. I found this code for deleting rows with missing values :
DATA data0;
SET data1;
IF cmiss(of _all_) then delete;
RUN;
But all this does is check every row, while I would like to delete a whole country if it has no observations in at least one variable.
Here's a part of the data :
If you want to delete the whole country if it has any information missing, you are on the right track, you just need to add a (group) by statement.
If your data is already sorted by country, as it appears to be in the picture, you can just run:
data want;
set have;
IF cmiss(of _all_) then delete;
by country;
If it is not sorted, you need to first run:
proc sort data=have;
by country;
However, if you have 60 years of data for every country, my guess is that you will not find a single one that have all the information for every year. It will be probably better to do some substantive choices of countries and periods you want to analyze, and then perform multiple imputatiom of missing data: https://support.sas.com/rnd/app/stat/papers/multipleimputation.pdf
You can use a DOW loop to compute which variable(s) contain only missing values within a group.
A second DOW loop outputs only those groups in which all variables contain at least on value.
Example:
data have;
call streaminit (2020);
do country = 1 to 6;
do year = 1960 to 1999;
array x gini kof tradegdp fdi gdp age_dep educ;
do over x;
x = rand('integer', 20, 100);
end;
if country = 1 then call missing (gini);
if country = 2 then call missing (educ);
if country = 4 then call missing (fdi);
output;
end;
end;
run;
data want;
* count number of non-missing values over group for each arrayed variable;
do _n_ = 1 by 1 until (last.country);
set have;
by country;
array x gini kof tradegdp fdi gdp age_dep educ;
array flag(100) _temporary_; * flag if variable has a non-missing value in group;
do _index = 1 to dim(x);
if not(flag(_index)) then flag(_index) = 1 - missing(x(_index));
end;
end;
* check if at least one variable has no values;
_remove_group_flag = sum(of flag(*)) ne dim(x);
do _n_ = 1 to _n_;
set have;
if not _remove_group_flag then output;
end;
call missing (of flag(*));
run;
Will LOG
NOTE: There were 240 observations read from the data set WORK.HAVE. First DOW loop
NOTE: There were 240 observations read from the data set WORK.HAVE. Second DOW loop
NOTE: The data set WORK.WANT has 120 observations and 11 variables. Conditional output

Select an observation if it has another within 24 hours of it

I am trying to create a table that only populates entries of a contact to a customer at a business number if they were NOT first contacted at a home number within 24 hours prior to the attempt at the business number.
So if I have
DATA HAVE;
INPUT ID RECORD DATETIME. TYPE;
FORMAT RECORD DATETIME.;
CARDS;
1 17MAY2018:06:24:28 H
1 18MAY2018:05:24:28 B
1 20MAY2018:06:24:28 B
2 20MAY2018:07:24:28 H
2 20MAY2018:08:24:28 B
2 22MAY2018:06:24:28 H
2 24MAY2018:06:24:28 B
3 25MAY2018:06:24:28 H
3 25MAY2018:07:24:28 B
3 25MAY2018:08:24:28 B
4 26MAY2018:06:24:28 H
4 26MAY2018:07:24:28 B
4 27MAY2018:08:24:28 H
4 27MAY2018:09:24:28 B
5 28MAY2018:06:24:28 H
5 29MAY2018:07:24:28 B
5 29MAY2018:08:24:28 B
;
RUN;
I want to be able to get
1 20MAY2018:06:24:28 B
2 24MAY2018:06:24:28 B
5 29MAY2018:07:24:28 B
5 29MAY2018:08:24:28 B
I have tried adding a count to the ID but I'm not sure how I'd go about using that, or if there's a way to use a subquery within a proc sql to create a count of observations that have more than one in a 24 hour period.
So, your approach will work, but will be quite messy with large numbers - as you're doing a cartesian join within ID. If each ID has few records it's not so bad, but if each ID has many records you make a lot of connections.
Fortunately, there's an easy way to do this in SAS!
data want;
do _n_ = 1 by 1 until (last.id); *for each ID:;
set have;
by id;
if first.id then last_home=0; *initialize last_home to 0;
if type='H' then last_home = record; *if it is a home then save it aside;
if type='B' and intck('Hour',last_home,record,'c') gt 24 then output; *if it is business then check if 24 hours have passed;
end;
format last_home datetime.;
run;
A few notes:
I use a DoW loop, but that really isn't mandatory, I just like it from a clarity perspective (it makes it clear I'm doing something at an ID-repetition level). You could remove that loop and add a RETAIN for last_home and it would be the same.
I use INTCK instead of INTNX - again this is for clarity, your INTNX is fine too, but INTCK just does the comparison, while INTNX is for advancing dates by an amount. I use the one that matches what I am trying to do, so someone reading the code can see easily what I'm doing.
This will be much faster than SQL on larger datasets, if for no other reason than it only passes the data once. SQL will necessarily do it multiple times, even if you don't separate HAVEA/HAVEB and do that within the SQL query.
I believe I figured it out!
I have HAVEA and HAVEB tables hosting type H and type B entries respectively.
Then I ran the following PROC SQL's.
PROC SQL;
CREATE TABLE WANTA AS
SELECT A.RECORD AS PREVIOUS_CALL, B.* FROM HAVEB B
JOIN HAVEA A ON (B.ID=A.ID AND A.RECORD LE B.RECORD);
CREATE TABLE WANTB AS
SELECT * FROM WANTA
GROUP BY ID, RECORD
HAVING PREVIOUS_CALL = MAX(PREVIOUS_CALL);
CREATE TABLE WANTC AS
SELECT * FROM WANTB
WHERE INTNX('HOUR',RECORD,-24,'SAME') GT PREVIOUS_CALL;
QUIT;
Please let me know if this is not a sustainable answer for larger sums of data or if there is a much better method of approaching this.
You perform a selection to get the final result set with out creating intermediate tables. Here are two alternatives:
First way
Similar to your 'figuring it out'. A reflexive join with grouping detects the "to_home" calls prior to the "to_business" calls that did NOT occur in the last 24 hours (86,400 seconds)
proc sql;
create table want as
select distinct
business.*
from have as business
join have as home
on business.id = home.id
& business.type = 'B'
& home.type = 'H'
& home.CALL_DT < business.CALL_DT
group by
business.call_dt
having
max(home.call_dt) < business.call_dt - 86400
;
Second way
Perform a NOT existential check, for a to_home call in prior 24hr, for every to_business call.
create table want2 as
select
business.*
from
have as business
where
business.type = 'B'
and
not exists (
select * from have as home
where home.id = business.id
and home.type = 'H'
and home.call_dt < business.call_dt
and home.call_dt >= business.call_dt - 86400
)
;
A HASH solution does have some dependencies (amount of data and RAM)...but it is another alternative
DATA HAVE;
INPUT ID RECORD DATETIME. TYPE $;
FORMAT RECORD DATETIME.;
CARDS;
1 17MAY2018:06:24:28 H
1 18MAY2018:05:24:28 B
1 20MAY2018:06:24:28 B
2 20MAY2018:07:24:28 H
2 20MAY2018:08:24:28 B
2 22MAY2018:06:24:28 H
2 24MAY2018:06:24:28 B
3 25MAY2018:06:24:28 H
3 25MAY2018:07:24:28 B
3 25MAY2018:08:24:28 B
4 26MAY2018:06:24:28 H
4 26MAY2018:07:24:28 B
4 27MAY2018:08:24:28 H
4 27MAY2018:09:24:28 B
5 28MAY2018:06:24:28 H
5 29MAY2018:07:24:28 B
5 29MAY2018:08:24:28 B
;
RUN;
/* Keep only HOME TYPE records and
rename RECORD for using in comparision */
Data HOME(Keep=ID RECORD rename=(record=hrecord));
Set HAVE(where=(Type="H"));
Run;
Data WANT(Keep=ID RECORD TYPE);
/* Use only BUSINESS TYPE records */
Set HAVE(where=(Type="B"));
/* Set up HASH object */
If _N_=1 Then Do;
/* Multidata:YES for looping through
all successful FINDs */
Declare HASH HOME(dataset:"HOME", multidata:'yes');
home.DEFINEKEY('id');
home.DEFINEDATA('hrecord');
home.DEFINEDONE();
/* To prevent warnings in the log */
Call Missing(HRECORD);
End;
/* FIND first KEY match */
rc=home.FIND();
/* Successful FINDs result in RC=0 */
Do While (RC=0);
/* This will keep the result of the most recent, in datetime,
HOME/BUS record comparision */
If intck('Hour',hrecord,record,'c') > 24 Then Good_For_Output=1;
Else Good_For_Output=0;
/* Keep comparing HOME/BUS for all HOME records */
rc=home.FIND_NEXT();
End;
If Good_For_Output=1 Then Output;
Run;

SAS Macro error with call execute

I have the following code that is being used generate running totals of features for the past 1 day, 7 days, 1 month, 3 months, and 6 months.
LIBNAME A "C:\Users\James\Desktop\data\Base Data";
LIBNAME DATA "C:\Users\James\Desktop\data\Data1";
%MACRO HELPER(P);
data a1;
set data.final_master_&P. ;
QUERY = '%TEST('||STRIP(DATETIME)||','||STRIP(PARTICIPANT)||');';
CALL EXECUTE(QUERY);
run;
%MEND;
%MACRO TEST(TIME,PAR);
proc sql;
select SUM(APP_1), SUM(APP_2), sum(APP_3), SUM(APP_4), SUM(APP_5) INTO :APP_1_24, :APP_2_24, :APP_3_24, :APP_4_24, :APP_5_24
FROM A1
WHERE DATETIME BETWEEN INTNX('SECONDS',&TIME.,-60*60*24) AND &TIME.;
/* 7 Days */
select SUM(APP_1), SUM(APP_2), sum(APP_3), SUM(APP_4), SUM(APP_5) INTO :APP_1_7DAY, :APP_2_7DAY, :APP_3_7DAY, :APP_4_7DAY, :APP_5_7DAY
FROM A1
WHERE DATETIME BETWEEN INTNX('SECONDS',&TIME.,-60*60*24*7) AND &TIME.;
/* One Month */
select SUM(APP_1), SUM(APP_2), sum(APP_3), SUM(APP_4), SUM(APP_5) INTO :APP_1_1MONTH, :APP_2_1MONTH, :APP_3_1MONTH, :APP_4_1MONTH, :APP_5_1MONTH
FROM A1
WHERE DATETIME BETWEEN INTNX('SECONDS',&TIME.,-60*60*24*7*4) AND &TIME.;
/* Three Months */
select SUM(APP_1), SUM(APP_2), sum(APP_3), SUM(APP_4), SUM(APP_5) INTO :APP_1_3MONTH, :APP_2_3MONTH, :APP_3_3MONTH, :APP_4_3MONTH, :APP_5_3MONTH
FROM A1
WHERE DATETIME BETWEEN INTNX('SECONDS',&TIME.,-60*60*24*7*4*3) AND &TIME.;
/* Six Months */
select SUM(APP_1), SUM(APP_2), sum(APP_3), SUM(APP_4), SUM(APP_5) INTO :APP_1_6MONTH, :APP_2_6MONTH, :APP_3_6MONTH, :APP_4_6MONTH, :APP_5_6MONTH
FROM A1
WHERE DATETIME BETWEEN INTNX('SECONDS',&TIME.,-60*60*24*7*4*6) AND &TIME.;
quit;
DATA T;
PARTICIPANT = &PAR.;
DATETIME = &TIME;
APP_1_24 = &APP_1_24.;
APP_2_24 = &APP_2_24.;
APP_3_24 = &APP_3_24.;
APP_4_24 = &APP_4_24.;
APP_5_24 = &APP_5_24.;
APP_1_7DAY = &APP_1_7DAY.;
APP_2_7DAY = &APP_2_7DAY.;
APP_3_7DAY = &APP_3_7DAY.;
APP_4_7DAY = &APP_4_7DAY.;
APP_5_7DAY = &APP_5_7DAY.;
APP_1_1MONTH = &APP_1_1MONTH.;
APP_2_1MONTH = &APP_2_1MONTH.;
APP_3_1MONTH = &APP_3_1MONTH.;
APP_4_1MONTH = &APP_4_1MONTH.;
APP_5_1MONTH = &APP_5_1MONTH.;
APP_1_3MONTH = &APP_1_3MONTH.;
APP_2_3MONTH = &APP_2_3MONTH.;
APP_3_3MONTH = &APP_3_3MONTH.;
APP_4_3MONTH = &APP_4_3MONTH.;
APP_5_3MONTH = &APP_5_3MONTH.;
APP_1_6MONTH = &APP_1_6MONTH.;
APP_2_6MONTH = &APP_2_6MONTH.;
APP_3_6MONTH = &APP_3_6MONTH.;
APP_4_6MONTH = &APP_4_6MONTH.;
APP_5_6MONTH = &APP_5_6MONTH.;
FORMAT DATETIME DATETIME.;
RUN;
PROC APPEND BASE=DATA.FLAGS_&par. DATA=T;
RUN;
%MEND;
%helper(1);
This code runs perfectly if I limit the number of observations in the %helper macro, using an (obs=) in the creation of the a1 dataset. However, when I put no limit on the obs number, i.e. execute the %test macro for every row in the dataset a1, I get errors. In SAS EG, I get a "server disconnected" popup after the status bar hangs at "running data step", and on Base SAS 9.4 I get the error that none of the macro variables have been resolved that are created in the proc sql into.
I'm confused as the code works fine for a limited amount of observations, but when trying on the whole dataset it hangs or gives errors. The dataset I'm doing this for has around 130,000 observations.
The answer to your actual question is that you're simply generating too much macro code and perhaps even simply taking too much time. The way you are doing this is going to operate on an O=n^2 level, as you're basically doing a cartesian join of every record to every record, and then some. 130,000 * 130,000 is a pretty decent sized number, and on top of that you're actually opening the SQL environment several times for each 130,000 rows. Ouch.
The solution is to do it either in a way that isn't too slow, or if it is, in a way that won't have too much overhead at least.
The fast solution is to not do the cartesian join, or to limit how much needs to be joined. One good solution would be to restructure the problem, not require every record to be compared, but instead consider each calendar day, say, a period, especially in the over-24h periods (24h you might do the way you do above, but not the other four). 1 month, 3 month, etc., do you really need to figure out time of day? Probably won't make much difference. If you can get rid of that, then you can use built in PROCs to precompile all possible 1 month periods, all possible 3 month periods, etc., and then join on the appropriate one. But that won't work with 130,000 of them; it would only work if you could limit it to one per day, probably.
If you must do it at the second level (or worse), what you'll want to do is avoid the cartesian join, and instead keep track of the various records you've seen already, and the sums. The short explanation of the algorithm is:
For each row:
Add this row's values to the rolling sums (at the end of the queue)
Check if the current item of the queue is outside of the period; if it is, subtract it from the rolling sums, and check the next item (repeat until not outside of the period), updating the current queue position
Return the sum at this point
This requires checking each row typically twice (except at odd boundaries where you have no rows popped off for several iterations, due to months having different numbers of days). This operates on O=n time, much faster than the cartesian join, and on top of that has far less memory/space required (the cartesian join might need to hit disk space).
The hash version of this solution is below. This will be the fastest solution I think that compares every row. Note that I intentionally make the test data have 1 for every row and same number of rows for every day; that lets you see how it works on a row-wise manner very easily. (For example, every 24h period has 481 rows, because I made 480 rows per day exactly, and 481 includes the same time yesterday - if you change lt to le it will be 480, if you prefer not to include same time yesterday). You can see that the 'month' based periods will have slightly odd results at the boundaries where months change because the '01FEB20xx' to '01MAY20xx' period has far fewer days (and thus rows) than the '01JUL20xx' to '01OCT20xx' period, for example; better would be 30/90/180 day periods.
data test_data;
array app[5] app_1-app_5;
do _i = 1 to 130000;
dt_var = datetime() - _i*180;
do _j = 1 to dim(app);
*app[_j] = floor(rand('Uniform')*6); *generate 0 to 5 integer;
app[_j]=1;
end;
output;
end;
format dt_var datetime17.;
run;
proc sort data=test_data;
by dt_var;
run;
%macro add(array=);
do _i = 1 to dim(app);
&array.[_i] + app[_i];
end;
%mend add;
%macro subtract(array=);
do _i = 1 to dim(app);
&array.[_i] + (-1*app[_i]);
end;
%mend subtract;
%macro process_array_add(array=);
array app_&array. app_&array._1-app_&array._5;
%add(array=app_&array.);
%mend process_array_add;
%macro process_array_subtract(array=, period=, number=);
if _n_ eq 1 then do;
declare hiter hi_&array.('td');
rc_&array. = hi_&array..first();
end;
else do;
rc_&array. = hi_&array..setcur(key:firstval_&array.);
end;
do while (intnx("&period.",dt_var,&number.,'s') lt curr_dt_var and rc_&array.=0);
%subtract(array=app_&array.);
rc_&array. = hi_&array..next();
end;
retain firstval_&array.;
firstval_&array. = dt_var;
%mend process_array_subtract;
data want;
set test_data;
* if _n_ > 10000 then stop;
curr_dt_var = dt_var;
array app[5] app_1-app_5;
if _n_ eq 1 then do;
declare hash td(ordered:'a');
td.defineKey('dt_var');
td.defineData('dt_var','app_1','app_2','app_3','app_4','app_5');
td.defineDone();
end;
rc_a = td.add();
*start macro territory;
%process_array_add(array=24h);
%process_array_add(array=1wk);
%process_array_add(array=1mo);
%process_array_add(array=3mo);
%process_array_add(array=6mo);
%process_array_subtract(array=24h,period=DTDay, number=1);
%process_array_subtract(array=1wk,period=DTDay, number=7);
%process_array_subtract(array=1mo,period=DTMonth, number=1);
%process_array_subtract(array=3mo,period=DTMonth, number=3);
%process_array_subtract(array=6mo,period=DTMonth, number=6);
*end macro territory;
rename curr_dt_var=dt_var;
format curr_dt_var datetime21.3;
drop dt_var rc: _:;
output;
run;
Here's a pure data step non-hash version. On my machine it's actually faster than the hash solution; I suspect it's not actually faster on a machine with a HDD (I have an SSD, so point access is not substantially slower than hash access, and I avoid having to load the hash). I would recommend using it if you don't know hashes very well or at all, as it'll be easier to troubleshoot, and it scales similarly. For most rows it accesses 11 rows, the current row and five other rows twice (one row, subtract it, then another row) for a total of around a million and a half total reads for 130k rows. (Compare that to about 17 billion reads for the cartesian...)
I suffix the macros with "_2" to differentiate them from the macros in the hash solution.
data test_data;
array app[5] app_1-app_5;
do _i = 1 to 130000;
dt_var = datetime() - _i*180;
do _j = 1 to dim(app);
*app[_j] = floor(rand('Uniform')*6); *generate 0 to 5 integer;
app[_j]=1;
end;
output;
end;
format dt_var datetime17.;
run;
proc sort data=test_data;
by dt_var;
run;
%macro add_2(array=);
do _i = 1 to dim(app);
&array.[_i] + app[_i];
end;
%mend add;
%macro subtract_2(array=);
do _i = 1 to dim(app);
&array.[_i] + (-1*app[_i]);
end;
%mend subtract;
%macro process_array_add_2(array=);
array app_&array. app_&array._1-app_&array._5; *define array;
%add_2(array=app_&array.); *add current row to array;
%mend process_array_add_2;
%macro process_array_sub_2(array=, period=, number=);
if _n_ eq 1 then do; *initialize point variable;
point_&array. = 1;
end;
else do; *do not have to do this _n_=1 as we only have that row;
set test_data point=point_&array.; *set the row that we may be subtracting;
end;
do while (intnx("&period.",dt_var,&number.,'s') lt curr_dt_var and point_&array. < _N_); *until we hit a row that is within the period...;
%subtract_2(array=app_&array.); *subtract the rows values;
point_&array. + 1; *increment the point to look at;
set test_data point=point_&array.; *set the new row;
end;
%mend process_array_sub_2;
data want;
set test_data;
*if _n_ > 10000 then stop; *useful for testing if you want to check time to execute;
curr_dt_var = dt_var; *save dt_var value from originally set record;
array app[5] app_1-app_5; *base array;
*start macro territory;
%process_array_add_2(array=24h); *have to do all of these adds before we start subtracting;
%process_array_add_2(array=1wk); *otherwise we have the wrong record values;
%process_array_add_2(array=1mo);
%process_array_add_2(array=3mo);
%process_array_add_2(array=6mo);
%process_array_sub_2(array=24h,period=DTDay, number=1); *now start checking to subtract what we need to;
%process_array_sub_2(array=1wk,period=DTDay, number=7);
%process_array_sub_2(array=1mo,period=DTMonth, number=1);
%process_array_sub_2(array=3mo,period=DTMonth, number=3);
%process_array_sub_2(array=6mo,period=DTMonth, number=6);
*end macro territory;
rename curr_dt_var=dt_var;
format curr_dt_var datetime21.3;
drop dt_var _:;
output; *unneeded in this version but left for comparison to hash;
run;

SAS for subsetting data

Consider following exemplary SAS dataset with following layout.
Price Num_items
100 10
120 15
130 20
140 25
150 30
I want to group them into 4 categories by defining a new variable called cat such that the new dataset looks as follows:
Price Num_items Cat
100 10 1
120 15 1
130 20 2
140 25 3
150 30 4
Also I want to group them so that they have about equal number of items (For example in above grouping Group 1 has 25, Group 2 has 20 ,Group 3 has 25 and Group 4 has 30 observations). Note that the price column is sorted in ascending order (that is required).
I am struggling to start with SAS for above. So any help would be appreciated. I am not looking for a complete solution but pointers towards preparing a solution would help.
Cool problem, subtly complex. I agree with #J_Lard that a data step with some retainment would likely be the quickest way to accomplish this. If I understand your problem correctly, I think the code below would give you some ideas as to how you want to solve it. Note that depending on the num_items, and group_target, your mileage will vary.
Generate similar, but larger data set.
data have;
do price=50 to 250 by 10;
/*Seed is `_N_` so we'll see the same random item count.*/
num_items = ceil(ranuni(_N_)*10)*5;
output;
end;
run;
Categorize.
/*Desired group size specification.*/
%let group_target = 50;
data want;
set have;
/*The first record, initialize `cat` and `cat_num_items` to 1 with implicit retainment*/
if _N_=1 then do;
cat + 1;
cat_num_items + num_items;
end;
else do;
/*If the item count for a new price puts the category count above the target, apply logic.*/
if cat_num_items + num_items > &group_target. then do;
/*If placing the item into a new category puts the current cat count closer to the `group_target` than would keeping it, then put into new category.*/
if abs(&group_target. - cat_num_items) < abs(&group_target. - (cat_num_items+num_items)) then do;
cat+1;
cat_num_items = num_items;
end;
/*Otherwise keep it in the currnet category and increment category count.*/
else cat_num_items + num_items;
end;
/*Otherwise keep the item count in the current category and increment category count.*/
else cat_num_items + num_items;
end;
drop cat_num_items;
run;
Check.
proc sql;
create table check_want as
select cat,
sum(num_items) as cat_count
from want
group by cat;
quit;

SAS - aggregating daily content with same day occurrences

I have daily data, not completely consecutive (i.e., not all days are present in a week) and I need to convert it to weekly totals. The catch is that the data pertains to transactions such that there are multiple observations with the same day. Using the following PROC EXPAND procedure results in an error "The value of the ID variable, FixtureDate=04JAN2011, at observation number 2 in data set RAW.VLCC2011 is the same as the previous observation":
PROC EXPAND DATA = raw.VLCC2011 OUT = raw.VLCC2011_wkly FROM= Day TO = Week;
convert FixtureCargoSize/ OBSERVED=TOTAL method=aggregate;
ID FixtureDate;
run;
Here's a solution with proc sql. You could also do something similar with a data step.
proc sql;
create table VLCC2011_wkly as
select intnx('week', date, 0, 'end') as week, sum(FixtureCargoSize) as FixtureCargoSizeTotal
from VLCC2011
group by calculated week;
quit;
The intnx function takes a date and moves it to some other date. In this case, it takes any date and moves it to the last day of the week. Summing over all dates that have the same end-of-week date in this way will give you want you want.
I've not used PROC EXPAND. However the error message tells you that it doesn't like that there are multiple observations per ID value. Maybe you need to pre-process the input data set RAW.VLCC2011 such that there is at most one observation per FIXTUREDATE.
Here is how I solved it (the long way):
data raw.VLCC2011_wkly;
set raw.VLCC2011;
IF FixtureDay < 8 then FixtureWeek = 1;
IF FixtureDay > 7 and FixtureDay < 15 then FixtureWeek = 2;
IF FixtureDay > 14 and FixtureDay < 23 then FixtureWeek = 3;
IF FixtureDay > 22 and FixtureDay < 30 then FixtureWeek = 4;
IF FixtureDay > 29 and FixtureDay < 32 then FixtureWeek = 5;
run;
proc sql;
create table raw.VLCC2011_wkly1 as
select FixtureMonth, FixtureDay, FixtureWeek, FixtureCargoSize, sum(FixtureCargoSize) as CargoSizeTotal
from raw.VLCC2011_wkly
group by FixtureMonth, FixtureWeek
Order by FixtureMonth, FixtureWeek, FixtureDay;
quit;