Update Dataset in SAS - sas

Afternoon,
I have a dataset with c2500 observations of which I want to work 200 a day. This original 2500 might change on a daily basis as data might come out or new data go in, so I need to join this updated list each day.
So far, I can create the starting list and I am using a Do While Loop to extract 200 obs which have a variable in called 'Worked' which changes from N to Y.
What I want to do is update the original list instead of creating a new one so I have 1 copy which only pulls 200 unworked obs. Whats the best way to do this?
Below is what I have so far:-
RETAIN COUNTER;
IF SENT = 'N' and COUNTER <= 200 then do;
IF COUNTER = . then COUNTER = 1;
COUNTER = COUNTER + 1;
SENT = 'Y';
output WORK.TODAYSLIST;
END;
OUTPUT WORK.STORE ;

Related

Create a Variable to Conditionally Equal Another Cell's Value

I have dataset with two variables 'shift' and 'scheduled'. The 'shift' variable contains a number of different time value records, for example "ED A 7a-4p"; the scheduled variable contains the number or days that shift is scheduled, so for example there would be a "3" in the cell to represent 3 days.
I created the following code to understand how many shifts are staffed at a given hour.
data ED_A_7a_4p;
set schedule schedule10;
if shift = 'ED A 7a-4p' and Scheduled = '3' then SevenToEightAM = ???;
if shift = 'ED A 7a-4p' and Scheduled = '7' then EightToNineAM = ???;
run;
I would like the created variables, for example 'SevenToEightAM', to equal the number that is in "scheduled" variable column. So if 'scheduled' is 3 I would want 'SevenToEightAM' to equal 3.
The issue is that 'scheduled' is totally random and I can't autocode it so I was hoping there is a conditional option in SAS that would allow me to make 'SevenToEightAM' to whatever "scheduled" is within my dataset.
You probably want a TABULATE report instead of creating new variables. Try:
data have;
set original;
scheduled_num = input(scheduled, best12.);
run;
Proc TABULATE data=have;
class shift;
var scheduled_num;
table shift, scheduled*sum;

SAS Looping to create dataset

I am new to sas and need some help (yes, Ive looked through everything - maybe I am just not asking it the right way but here I am): lets say I want to create a dataset from sashelp.cars and I want there to be 5 observations for every make:
ie: 5 obs for acura, 5 obs for audi, 5 obs for bmw etc. ANd I want all the data returned, but only limited to the 5 observations per make.
How would I do this without a macro but a loop instead? My actual data set has 93 distinct values and I don't want to use 93 macro calls
Thanks in advance!!!!
Which 5 obs do you want for each make? The first 5? The last 5? Some sort of random sample?
If it's the latter, proc surveyselect is the way to go:
proc sort data = sashelp.cars out = cars;
by make;
run;
proc surveyselect
data = cars
out = mysample
method = URS
n = 5
selectall;
strata make;
run;
Setting method = URS requests unrestricted random sampling with replacement. As this allows the same row to be selected multiple times, we are guaranteed 5 rows per make in the sample, even if there are < 5 in the input dataset. If you just want to take all available rows in that scenario, you can use method = srs to request simple random sampling.
If you want the first 5 per make, then sort as before, then use a data step:
data mysample;
set cars;
by make;
if first.make then rowcount = 0;
rowcount + 1;
if rowcount <= 5;
run;
Getting the last 5 rows per make is very similar - if you have a key column that you can use to reverse the order within each make, that's the simplest option.

SAS - Do Loops Iterations

Can someone explain this code to me in depth?? I have a list of comments in the code where I am confused. Is there anyway I can attach a csv of the data? Thanks in advance.
data have;
infile "&sasforum.\datasets\Returns.csv" firstobs=2 dsd truncover;
input DATE :mmddyy10. A B B_changed;
format date yymmdd10.;
run;
data spread;
do nb = 1 by 1 until(not missing(B));
set have;
end;
br = B;
do i = 1 to nb;
set have; *** I don't get how you can do i = 1 to nb with set have. There is not variable nb on set have. The variable nb is readinto the dataset spread;
if nb > 1 then B_spread = (1+br)**(1/nb) - 1;
else B_spread = B;
output;
end;
drop nb i br;
run;
***** If i comment out "drop nb i br" i get to see that nb takes a value of 2 for the null values of B.. I don't get how this is done or possible. Because if I run the code right after the line "br = B", and put an output statement in the first do loop, I am clearly seeing that nb takes a valueof one for B null values.Honestly, It is like the first do loop is reads in future observations for B as BR. Can you please explain this to me. The second dataset "bunch" seems to follow the same type of principles as the first... So i imagine if I get a grasp on the first on how the datasetspread is created, then I will understand how bunch is created.;
This is an advanced DATA step programming technique, commonly referred to as a DoW loop. If you search lexjansen.com for DoW, you will find helpful papers like http://support.sas.com/resources/papers/proceedings09/038-2009.pdf. The DoW loop codes and explicit loop around a SET statement. This is actually a "Double-DoW loop", because you have two explicit loops.
I made some sample data, and added some PUT statements to your code:
data have ;
input B ;
cards ;
.
.
1
2
.
.
.
3
;
data spread;
do nb = 1 by 1 until(not missing(B));
set have;
put _n_= "top do-loop " (nb B)(=) ;
end;
br = B;
do i = 1 to nb;
set have;
if nb > 1 then B_spread = (1+br)**(1/nb) - 1;
else B_spread = B;
output;
put _n_= "bottom do-loop " (nb B br B_spread)(=) ;
end;
drop nb i br;
run;
With that sample data, on the first iteration of the DATA step (N=1), the top do loop will iterate three times, reading the first three records of HAVE. At that point, (not missing(B)) will be true, and the loop will not iterate again. The variable NB will have a value of 3. The bottom loop will then iterate 3 times, because NB has a value of 3. It will also read the first three records have HAVE. It will compute B_Spread, and output each record.
On the second iteration of the DATA step, the top DO loop will iterate only once. It will read the 4th record, with B=2. The bottom loop will iterate once, reading the 4th record, computing B_spread, and output.
On the third iteration of the DATA step, the top DO loop will iterate four times, reading the 5th through 8th records. The bottom loop will also iterate four times, reading the 5th through 8th records, computing B_spread, and output.
On the fourth iteration of the DATA step, the step to complete, because the SET statement in the top loop will read the End Of File mark.
The core concept of a Double-DoW loop is that typically you are reading the data in groups. Often groups are identified by an ID. Here they are defined by sequential records read until not missing(B). The top DO-loop reads the first group of records, and computes some value (in this case, it computes NB, the number of records in the group). Then the bottom DO-loop reads the first group of records, and computes some new value, using the value computed in top DO-loop. In this case, the bottom DO-loop computes B_spread, using NB.

optimization of updating only missing values in the master dataset via update statement

I have a master dataset that contains missing values.
Sample looks like
Date Index1 Index2 Key
01NOV 20 . a
02NOV . 30 a
02NOV 10 20 a
I also have a update dataset that contains no missing values.
Date Index1 Index2 Key
01NOV 10 10 a
02NOV 5 40 a
The idea is, if data matches and master dataset has missing values under index then replace it with corresponding index in the update dataset. If not, preserve its value.
Output should be
Date Index1 Index2 Key
01NOV 20 10 a
01NOV 5 30 a
02NOV 10 20 a
My code is below
proc sql;
update master as a
set index1 = case when a.index1 ^= . then a.index1 else (select index1 from update as b where a.Date = b.Date and a.Key = b.Key) end,
index2 = case when a.index2 ^= . then a.index2 else (select index2 from update as b where a.Date = b.Date and a.Key = b.Key) end;
quit;
But both master and update are large. Is there a way to optimize this?
EDIT
How to update the master within a specific period? where a.Date = b.Date and a.Date between sDate and eDate?
If SQL Update is too slow, then the best way to do this is probably to create formats or a hash table, depending on your available memory and how many variables you have. SQL update will tend to be slow in cases like this, even if you have properly indexed tables.
It's probably worth giving it a try first with the SQL Update, though, with properly indexed tables.
Make sure all tables are sorted by date.
Create indexes on both tables on date.
Update one at a time.
This example is pretty quick for me - 4 minutes or so for 6.5MM/1.5MM rows where about half of the 6.5MM rows need updating - obviously 150MM rows will take longer, but the total time should scale well.
data sample;
call streaminit(7);
do key = 1 to 1000;
do date = '01JAN2011'd to '31DEC2014'd;
do _t = 1 to rand('Normal',5,2);
if rand('Uniform') < 0.8 then val1=10;
if rand('Uniform') < 0.6 then val2=20;
output;
call missing(of val1, val2);
end;
end;
end;
run;
data update_t;
do key = 1 to 1000;
do date='01JAN2011'd to '31DEC2014'd;
val1=10;
val2=20;
output;
end;
end;
run;
proc sql;
create index keydate on sample (key, date);
create index keydate on update_t (key, date);
update sample S
set val1=coalesce(val1,
(select val1 from update_t U where U.key = S.key and U.date=S.date)),
val2=coalesce(val2,
(select val2 from update_t U where U.key = S.key and U.date=S.date))
where n(s.val1,s.val2) < 2;
quit;
I make sure that only rows with a missing val get updated with the where statement, but otherwise this is pretty standard. Unfortunately SAS won't do updates with joins (it may well work the same on the back end, but you can't say update S,U set S.blah=U.blah as you can in some other SQLs). Note here the SAMPLE and UPDATE tables are both sorted (because I created them sorted); if they weren't sorted you would need to sort both of them to get optimal behavior.
If you want a faster option, a format or hash table is your friend. I'll show the format here.
data update_formats;
set update_t;
length start $50;
start=catx('|',key,date);
label=val1;
fmtname='$VAL1F';
output;
label=val2;
fmtname='$VAL2F';
output;
if _n_=1 then do;
hlo='o';
label=' ';
start=' ';
output;
fmtname='$VAL1F';
output;
end;
run;
proc sort data=update_formats;
by fmtname;
run;
proc format cntlin=update_formats;
quit;
data sample;
modify sample;
if n(val1,val2) < 2; *where is slower for some reason;
val1=coalesce(val1,input(put(catx('|',key,date),$VAL1F.),best12.));
val2=coalesce(val2,input(put(catx('|',key,date),$VAL2F.),best12.));
run;
This uses formats to convert id+date to val1 or val2. It will tend to be faster than the SQL update unless the number of rows in the update table is very high (1.5MM should be okay, eventually though format starts to slow down). The total time for this will tend to be not much higher than the write time for the table - in this case (baseline: 2 seconds to write SAMPLE initially) it took 13 seconds to load the formats and then another 13 seconds to use them/write the new SAMPLE dataset - total time under 30 seconds versus 4 minutes for the SQL update (and also not requiring index creation or sorting the bigger table).

Count number of living firms efficiently

I have a list of companies with start and end dates for each. I want to count the number of companies alive over time. I have the following code but it runs slowly on my large dataset. Is there a more efficient way to do this in Stata?
forvalues y = 1982/2012 {
forvalues m = 1/12 {
*display date("01-`m'-`y'","DMY")
count if start_dt <= date("01-`m'-`y'","DMY") & date("01-`m'-`y'","DMY") <= end_dt
}
}
One way is to use the inrange function. In Stata, Date variables are just integers so you can easily operate on them.
forvalues y = 1982/2012 {
forvalues m = 1/12 {
local d = date("01-`m'-`y'","DMY")
count if inrange(`d', start_dt, end_dt)
}
}
This alone will save you a huge amount of time. For 50.000 observations (and made-up data):
. timer list 1
1: 3.40 / 1 = 3.3980
. timer list 2
2: 18.61 / 1 = 18.6130
timer 1 is with inrange, timer 2 is your original code. Results are in seconds. Run help inrange and help timer for details.
That said, maybe someone can suggest an overall better strategy.
Assuming a firm identifier firmid, this is another way to think about the problem, but with a different data structure. Make sure you have a saved copy of your dataset before you do this.
expand 2
bysort firmid : gen eitherdate = cond(_n == 1, start_dt, end_dt)
by firmid : gen score = cond(_n == 1, 1, -1)
sort eitherdate
gen living = sum(score)
by eitherdate : replace living = living[_N]
So,
We expand each observation to 2 and put both dates in a new variable, the start date in one observation and the end date in the other observation.
We assign a score that is 1 when a firm starts and -1 when it ends.
The number of firms is increased by 1 every time a firm starts and decreased by 1 every time one ends. We just need to sort by date and the number of firms is the cumulative sum of those scores. (EDIT: There is a fix for changes on the same date.)
This new data structure could be useful for other purposes.
There is a write-up at http://www.stata-journal.com/article.html?article=dm0068
EDIT:
Notes in response to #Roberto Ferrer (and anyone else who read this):
I fixed a bad bug, which made this too difficult to understand. Sorry about that.
The dates used here are just the dates at which firms start and end. There is no evident point in evaluating the number of firms at any other date as it would just be the same number as the previous date used. If you needed, however, to interpolate to a grid of dates, copying the previous count would be sufficient.
It is important not to confuse the Stata function sum() which returns the cumulative sum with any egen function. The impression that egen's total() is an alternative here was a side-effect of my bug.