create columns dynamically using range of values - sas

I wasn't able to find a previous post that answered my query - will be happy to follow link to duplicate post.
I want to dynamically create columns based on range (in SAS) - something like below, which doesn't work
data work.test;
set sashelp.air;
format mb monyy.;
do i = 1 to 10;
mb&i = intnx('MONTH', date, -i, 'same');
end;
run;
The expected result is 10 new columns called mb1 to mb10 with the respective date-interval value.
Any help is appreciated!

You are getting macro involved that you don't need. Notice I create array MB and replace the MB&I with an array reference. I don't have SASHELP.AIR so just used today() but you get the idea.
data work.test;
*set sashelp.air;;
array mb[10];
date = today();
do i = 1 to dim(mb);
mb[i] = intnx('MONTH', date, -i, 'same');
end;
format mb: monyy.;
run;

Related

Apply if and do in SAS to merge a dataset

I'm trying to merge a dataset to another table (hist_dataset) by applying one condition.
The dataset that I'm trying to merge looks like this:
Label
week_start
date
Value1
Value2
Ac
09Jan2023
13Jan2023
45
43
The logic that I'm using is the next:
If the value("week_start" column) of the first record is equal to today's week + 14 then merge the dataset with the dataset that I want to append.
If the value(week_start column) of the first record is not equal to today's week + 14 then do nothing, don't merge the data.
The code that I'm using is the next:
libname out /"path"
data dataset;
set dataset;
by week_start;
if first.week_start = intnx('week.2', today() + 14, 0, 'b') then do;
data dataset;
merge out.hist_dataset dataset;
by label, week_start, date;
end;
run;
But I'm getting 2 Errors:
117 - 185: There was 1 unclosed DO block.
161 - 185: No matching DO/SELECT statement.
Do you know how can make the program run correctly or do you know another way to do it?
Thanks,
'''
I cannot make heads or tails of what you are asking. So let me take a guess at what you are trying to do and give answer to my guesses.
Let's first make up some dataset and variable names. So you have an existing dateset named OLD that has key variables LABEL WEEK_START and DATE.
Now you have received a NEW dataset that has those same variables.
You want to first subset the NEW dataset to just those observations where the value of DATE is within 14 days of the first value of START_WEEK in the NEW dataset.
data subset ;
set new;
if _n_=1 then first_week=start_week;
retain first_week;
if date <= first_week+14 ;
run;
You then want to merge that into the OLD dataset.
data want;
merge old subset;
by label week_start date ;
run;

Combine one column's values into a single string

This might sound awkward but I do have a requirement to be able to concatenate all the values of a char column from a dataset, into one single string. For example:
data person;
input attribute_name $ dept $;
datalines;
John Sales
Mary Acctng
skrill Bish
;
run;
Result : test_conct = "JohnMarySkrill"
The column could vary in number of rows in the input dataset.
So, I tried the code below but it errors out when the length of the combined string (samplkey) exceeds 32K in length.
DATA RECKEYS(KEEP=test_conct);
length samplkey $32767;
do until(eod);
SET person END=EOD;
if lengthn(attribute_name) > 0 then do;
test_conct = catt(test_conct, strip(attribute_name));
end;
end;
output; stop;
run;
Can anyone suggest a better way to do this, may be break down a column into chunks of 32k length macro vars?
Regards
It would very much help if you indicated what you're trying to do but a quick method is to use SQL
proc sql NOPRINT;
select name into :name_list separated by ""
from sashelp.class;
quit;
%put &name_list.;
As you've indicated macro variables do have a size limit (64k characters) in most installations now. Depending on what you're doing, a better method may be to build a macro that puts the entire list as needed into where ever it needs to go dynamically but you would need to explain the usage for anyone to suggest that option. This answers your question as posted.
Try this, using the VARCHAR() option. If you're on an older version of SAS this may not work.
data _null_;
set sashelp.class(keep = name) end=eof;
length long_var varchar(1000000);
length want $256.;
retain long_var;
long_var = catt(long_var, name);
if eof then do;
want = md5(long_var);
put want;
end;
run;

optimization of updating only missing values in the master dataset via update statement

I have a master dataset that contains missing values.
Sample looks like
Date Index1 Index2 Key
01NOV 20 . a
02NOV . 30 a
02NOV 10 20 a
I also have a update dataset that contains no missing values.
Date Index1 Index2 Key
01NOV 10 10 a
02NOV 5 40 a
The idea is, if data matches and master dataset has missing values under index then replace it with corresponding index in the update dataset. If not, preserve its value.
Output should be
Date Index1 Index2 Key
01NOV 20 10 a
01NOV 5 30 a
02NOV 10 20 a
My code is below
proc sql;
update master as a
set index1 = case when a.index1 ^= . then a.index1 else (select index1 from update as b where a.Date = b.Date and a.Key = b.Key) end,
index2 = case when a.index2 ^= . then a.index2 else (select index2 from update as b where a.Date = b.Date and a.Key = b.Key) end;
quit;
But both master and update are large. Is there a way to optimize this?
EDIT
How to update the master within a specific period? where a.Date = b.Date and a.Date between sDate and eDate?
If SQL Update is too slow, then the best way to do this is probably to create formats or a hash table, depending on your available memory and how many variables you have. SQL update will tend to be slow in cases like this, even if you have properly indexed tables.
It's probably worth giving it a try first with the SQL Update, though, with properly indexed tables.
Make sure all tables are sorted by date.
Create indexes on both tables on date.
Update one at a time.
This example is pretty quick for me - 4 minutes or so for 6.5MM/1.5MM rows where about half of the 6.5MM rows need updating - obviously 150MM rows will take longer, but the total time should scale well.
data sample;
call streaminit(7);
do key = 1 to 1000;
do date = '01JAN2011'd to '31DEC2014'd;
do _t = 1 to rand('Normal',5,2);
if rand('Uniform') < 0.8 then val1=10;
if rand('Uniform') < 0.6 then val2=20;
output;
call missing(of val1, val2);
end;
end;
end;
run;
data update_t;
do key = 1 to 1000;
do date='01JAN2011'd to '31DEC2014'd;
val1=10;
val2=20;
output;
end;
end;
run;
proc sql;
create index keydate on sample (key, date);
create index keydate on update_t (key, date);
update sample S
set val1=coalesce(val1,
(select val1 from update_t U where U.key = S.key and U.date=S.date)),
val2=coalesce(val2,
(select val2 from update_t U where U.key = S.key and U.date=S.date))
where n(s.val1,s.val2) < 2;
quit;
I make sure that only rows with a missing val get updated with the where statement, but otherwise this is pretty standard. Unfortunately SAS won't do updates with joins (it may well work the same on the back end, but you can't say update S,U set S.blah=U.blah as you can in some other SQLs). Note here the SAMPLE and UPDATE tables are both sorted (because I created them sorted); if they weren't sorted you would need to sort both of them to get optimal behavior.
If you want a faster option, a format or hash table is your friend. I'll show the format here.
data update_formats;
set update_t;
length start $50;
start=catx('|',key,date);
label=val1;
fmtname='$VAL1F';
output;
label=val2;
fmtname='$VAL2F';
output;
if _n_=1 then do;
hlo='o';
label=' ';
start=' ';
output;
fmtname='$VAL1F';
output;
end;
run;
proc sort data=update_formats;
by fmtname;
run;
proc format cntlin=update_formats;
quit;
data sample;
modify sample;
if n(val1,val2) < 2; *where is slower for some reason;
val1=coalesce(val1,input(put(catx('|',key,date),$VAL1F.),best12.));
val2=coalesce(val2,input(put(catx('|',key,date),$VAL2F.),best12.));
run;
This uses formats to convert id+date to val1 or val2. It will tend to be faster than the SQL update unless the number of rows in the update table is very high (1.5MM should be okay, eventually though format starts to slow down). The total time for this will tend to be not much higher than the write time for the table - in this case (baseline: 2 seconds to write SAMPLE initially) it took 13 seconds to load the formats and then another 13 seconds to use them/write the new SAMPLE dataset - total time under 30 seconds versus 4 minutes for the SQL update (and also not requiring index creation or sorting the bigger table).

Moving average only for consecutive days in sas

I have a question regarding moving average. I use Proc Expand (cmovave 3), but those three days can be non consecutive I suppose. I want to avoid missing data between days and use moving average for just those adjacent days.
Is there any way that I can do this? If I want to put it in another way 'how can I select a part of my data set where I have values for consecutive period (days)?'. I hope you give me some examples for this problem.
Use Expand to make sure you have all the values in the timeseries interval. Then use a data step to calculate the ma3 with the lagN() functions.
If you data already has the correct timeseries interval, then skip the PROC EXPAND step.
data test;
start = "01JAN2013"d;
format date date9.
value best.;
do i=1 to 365;
r = ranuni(1);
value = rannor(1);
date = intnx('weekday',start,i);
dummy=1;
if r > .33 then output;
end;
drop i start r;
run;
proc expand data=test out=test2 to=weekday ;
id date;
var dummy;
run;
data test(drop=dummy);
merge test2 test;
by date;
ma3 = (value + lag(value) + lag2(value))/3;
run;
I use the DUMMY variable so that EXPAND will convert the series to WEEKDAY. Then drop it afterwards.

SAS-How to format arrays dynamically based on information in one column

I'm new to SAS, and would greatly appreciate anyone who can help me formulate a code. Can someone please help me with formatting changing arrays based on the first column values?
So basically here's the original data:
Category Name1 Name2......... (Changes invariably)
#ofpeople 20 30
#ofproviders 10 5
#ofclaims 40 25
AmountBilled 50 100
AmountPaid 11 35
AmountDed 5 6
I would like to format the values under Name1 to infinite Name# and reformat them to dollar10.2 for any values under Category called 'AmountBilled','AmountPaid','AmountDed'.
Thank you so much for your help!
You can't conditionally format a column (like you might in excel). A variable/column has one format for the entire column. There are tricks to get around this, but they're invariably more complex than should be considered useful.
You can store the formatted value in a character variable, but it loses the ability to do math.
data have;
input category :$10. name1 name2;
datalines;
#ofpeople 20 30
#ofproviders 10 5
#ofclaims 40 25
AmountBilled 50 100
AmountPaid 11 35
AmountDed 5 6
;;;;
run;
data want;
set have;
array names name:; *colon is wildcard (starts with);
array newnames $10 newname1-newname10; *Arbitrarily 10, can be whatever;
if substr(category,1,6)='Amount' then do;
do _t = 1 to dim(names);
newnames[_t] = put(names[_t],dollar10.2);
end;
end;
run;
You could programmatically figure out the newname1000 endpoint using PROC CONTENTS or SQL's DICTIONARY.COLUMNS / SAS's SASHELP.VCOLUMN. Alternately, you could put out the original dataset as a three column dataset with many rows for each category (was it this way to begin with prior to a PROC TRANSPOSE?) and put the character variable there (not needing an array). To me that's the cleanest option.
data have_t;
set have;
array names name:;
format nameval $10.;
do namenum = 1 to dim(names);
if substr(category,1,6)='Amount' then nameval = put(names[namenum],dollar10.2 -l);
else nameval=put(names[namenum],10. -l); *left aligning here, change this if you want otherwise;
output; *now we have (namenum) rows per line. Test for missing(name) if you want only nonmissing rows output (if not every row has same number of names).
end;
run;
proc transpose data=have_t out=want_T(drop=_name_) prefix=name;
by category notsorted;
var nameval;
run;
Finally, depending on what you're actually doing with this, you may have superior options in terms of the output method. If you're doing PROC REPORT for example, you can use compute blocks to set the style (format) of the column conditionally in the report output.