Decreasing values of operations in compcost increase gedscore? - sas

I'm afraid I'm running across the following:
Method 1:
proc sql;
create table as
...
compged(a.plan_id, b.plan_id,&maxscore.,'iL') as gedscore
from view_a a, view_b b
where a.state = b.state and calculated gedscore < &maxscore.
order by calculated gedscore;
This works, it's all fine and dandy, but I would like to adjust my results slightly with compcost. So I adopt Method 2:
proc sql;
create view tempview as select
...
from view_a a, view_b b
where a.state = b.state;
quit;
data modified_gedscore
set tempview;
if _N_ = 1 then call compcost('delete=',10,'truncate=',10);
gedscore = compged(el_plan, clms_plan,&maxscore.,'iL');
if gedscore < &maxscore.;
run;
There's a bit more to it, but I've tried to isolate the relevant bits. I have tried to decrease the cost of the operations delete and truncate (it makes sense given the data I'm working with and what I'm trying to accomplish). My expected result would be due to delete and truncate operations having a lower cost, more observations would have a gedscore < &maxscore. However, I'm afraid I am seeing the following: the call compcost is actually dramatically decreasing the number of observations I see. Do I have a basic misunderstanding as to how call compcost works? If the above is incorrect, how would I adjust compged to have deletion of characters be more likely to fall under the maxscore threshold?
Edit: Also, I understand that the different structuring of the two methods would raise the possibility of something other than call compcost causing the unexpected results, but if I simply comment out the call compcost line I get results equivalent to that in Method 1. So, nope.
Edit2: sample data. First observation is equivalent (0). Second yields higher gedscore under method 2 than method 1, even though the compcost of delete and truncate has been lowered, with no other changes.
data sample_data;
input state1 $ plan1 $ plan2 $;
datalines;
ID DENTAL DENTAL
GA GBHC GBCH
;
Edit3: I think I may have found the problem. It appears that the default compged costs (here) are different from the default compcost costs (here). When compcost is called, all operations not specified are set to the compcost defaults, which are usually higher. If anybody feels like confirming, feel free.
Thanks for your help

The issue is that COMPGED is not using the SWAP cost, but instead only using DELETE and INSERT (the latter of which costs 100). That's because of how CALL COMPCOST works; for some reason (that makes little sense to me), CALL COMPCOST's default values are not equal to COMPGED's default values - and it inserts a default value into every other operation that you do not specify.
In order to make this work, it looks like you'll have to specify a value for everything that you want it to use, in particular, APPEND, BLANK, PUNCTUATION, SINGLE, SWAP, and TRUNCATE (the latter of which you do specify already). From the doc, as of 9.2, the defaults were 50,10,30,20,20,10 for COMPGED for those.
In your example:
data sample_data;
input state1 $ plan1 $ plan2 $;
call compcost('del=',10,'truncate=',10,'swap=',20);
compged_1 = compged(plan1,plan2,'il');
put compged_1=;
datalines;
ID DENTAL DENTAL
GA GBHC GBCH
;
run;
Now returns 20 instead of 110.

Related

How to choose indexed assignment variable dynamically in SAS?

I am trying to build a custom transformation in SAS DI. This transformation will "act" on columns in an input data set, producing the desired output. For simplicity let's assume the transformation will use input_col1 to compute output_col1, input_col2 to compute output_col2, and so on up to some specified number of columns to act on (let's say 2).
In the Code Options section of the custom transformation users are able to specify (via prompts) the names of the columns to be acted on; for example, a user could specify that input_col1 should refer to the column named "order_datetime" in the input dataset, and either make a similar specification for input_col2 or else leave that prompt blank.
Here is the code I am using to generate the output for the custom transformation:
data cust_trans;
set &_INPUT0;
i=1;
do while(i<3);
call symputx('index',i);
result = myfunc("&&input_col&index");
output_col&index = result; /*what is proper syntax here?*/
i = i+1;
end;
run;
Here myfunc refers to a custom function I made using proc fcmp which works fine.
The custom transformation works fine if I do not try to take into account the variable number of input columns to act on (i.e. if I use "&&input_col&i" instead of "&&input_col&index" and just use the column result on the output table).
However, I'm having two issues with trying to make the approach more dynamic:
I get the following warning on the line containing
result = myfunc("&&input_col&index"):
WARNING: Apparent symbolic reference INDEX not resolved.
I do not know how to have the assignment to the desired output column happen dynamically; i.e., depending on the iteration of the do loop I'd like to assign the output value to the corresponding output column.
I feel confident that the solution to this must be well known amongst experts, but I cannot find anything explaining how to do this.
Any help is greatly appreciated!
You can't use macro variables that depend on data variables, in this manner. Macro variables are resolved at compile time, not at run time.
So you either have to
%do i = 1 %to .. ;
which is fine if you're in a macro (it won't work outside of an actual macro), or you need to use an array.
data cust_trans;
set &_INPUT0;
array in[2] &input_col1 &input_col2; *or however you determine the input columns;
array output_col[2]; *automatically names the results;
do i = 1 to dim(in);
result = myfunc(in[i]); *You quote the input - I cannot see what your function is doing, but it is probably wrong to do so;
output_col[i] = result; /*what is proper syntax here?*/
end;
run;
That's the way you'd normally do that. I don't know what myfunc does, and I also don't know why you quote "&&input_col&index." when you pass it to it, but that would be a strange way to operate unless you want the name of the input column as text (and don't want to know what data is in that variable). If you do, then pass vname(in[i]) which passes the name of the variable as a character.

first and last order SAS

I have two lines of data,
Order
17/01/2016
01/02/2014
Basically I want to run a logic like so;
data A.test_active;
set A.Weekly_Email_files_cleaned4;
length active :8.;
length inactive :8.;
if first.Order between '01Jan2014'd and '31Dec2015'd then active= 1;
if last.order between '01Jan2014'd and '31Dec2015'd then inactive= 1;
run;
the field "Order" is formatted by DDMMYY10 when I checked the file properties, but I keep getting this error
ERROR 388-185: Expecting an arithmetic operator.
Can anyone help or suggest something different in the same vain?
In SAS, between is only valid in SQL contexts: either actual PROC SQL, or WHERE statements, generally. It is not otherwise valid in SAS. You would use in (firstval:lastval) instead, if those values are integers (dates are). If they're not integers, you need to use if firstval le val le lastval or similar (can also use ge/lt/gt/>/< or whatever you like, depending on the ordering of things).
Second, first.order and last.order are boolean values - 1 or 0, nothing else, that indicate that you are on a row that is the first row for a new value when sorted by that variable, or the last row similarly. You also must have a by statement by that variable if you're going to use them.
Third, your length statements are wrong; you're confusing some three different things here, I think. Length statements for numerics aren't needed if you're using default length 8, and if you do like having them anyway, you need:
length active 8;
No : or ., both are used for different purposes.
ID first_order Order
alex 01/01/2013 23/01/2015
alex 01/01/2013 23/01/2015
alex 01/01/2013 03/04/2013
basically if an order exists after the first order that is within a certain timeframe (within a year of the date of the first order) then the user is "active"
any ideas much appreciated
thanks

Stata: estimating monthly weighted mean for portfolio

I have been struggling to write optimal code to estimate monthly, weighted mean for portfolio returns.
I have following variables:
firm stock returns (ret)
month1, year1 and date
portfolio (port1): this defines portfolio of the firm stock returns
market capitalisation (mcap): to estimate weights (by month1 year1 port1)
I want to calculate weighted returns for each month and portfolio weighted by market cap. (mcap) of each firm.
I have written following code which works without fail but takes ages and is highly inefficient:
foreach x in 11 12 13 21 22 23 {
display `x'
forvalues y = 1980/2010 {
display `y'
forvalues m = 1/12 {
display `m'
tempvar tmp_wt tmp_tm tmp_p
egen `tmp_tm' = total(mcap) if month1==`m' & year1==`y' & port1 ==`x'
gen `tmp_wt' = mcap/`tmp_tm' if month1==`m' & year1==`y' & port1 ==`x'
gen `tmp_p' = ret*`tmp_wt' if month1==`m' & year1==`y' & port1 ==`x'
gen port_ret_`m'_`y'_`x' = `tmp_p'
}
}
}
Data looks as shown in the image:![Data for value weighted portfolio return][1]
This does appear to be a casebook example of how to do things as slowly as possible, except that naturally you are not doing that on purpose. All it lacks is a loop over observations to calculate totals. So, the good news is that you should indeed be able to speed this up.
It seems to boil down to
gen double wanted = .
bysort port1 year month : replace wanted = sum(mcap)
by port1 year month : replace wanted = (mcap * ret) / wanted[_N]
Principle. To get a sum in a single scalar, use summarize, meanonly rather than using egen, total() to put that scalar into a variable repeatedly, but use sum() with by: to get group sums into a variable when that is what you need, as here. sum() returns cumulative sums, so you want the last value of the cumulative sum.
Principle. Loops (here using foreach) are not needed when a groupwise calculation can be done under the aegis of by:. That is a powerful construct which Stata programmers need to learn.
Principle. Creating lots of temporary variables, here 6 * 31 * 12 * 3 = 6696 of them, is going to slow things down and use more memory than is needed. Each time you execute tempvar and follow with generate commands, there are three more temporary variables, all the size of a column in a dataset (that's what a variable is in Stata), but once they are used they are just left in memory and never looked at again. It's a subtlety with temporary variables that a tempvar assigns a new name every time, but it should be clear that generate creates a new variable every time; generate will never overwrite an existing variable. The temporary variables would all be dropped at the end of a program, but by the end of that program, you are holding a lot of stuff unnecessarily, possibly the size of the dataset multiplied by about one thousand. If that temporarily expanded dataset could not all fit in memory, you flip Stata into a crawl.
Principle. Using if obliges Stata to check each observation in turn; in this case most are irrelevant to the particular intersection of loops being executed and you make Stata check almost all of the data set (a fraction of 2231/2232, almost 1) irrelevantly while doing each particular calculation for 1/2232 of the dataset. If you have more years, or more portfolios, the fraction looked at irrelevantly is even higher.
In essence, Stata will obey your instructions (and not try any kind of optimization -- your code is interpreted utterly literally) but by: would give the cross-combinations much more rapidly.
Note. I don't know how big or how close to zero these numbers will get, so I gave you a double. For all I know, a float would work fine for you.
Comment. I guess you are being influenced by coding experience in other languages where creating variables means something akin to x = 42 to hold a constant. You could do that in Stata too, with scalars or local or global macros, not to mention Mata. Remember that a new variable in Stata is an entire new column in the dataset, regardless of whether it is holding a constant or different values in each observation. You will get what you ask for, but it is more like getting an array every time. Again, it seems that you want as an end result just one new variable, and you do not in fact need to create any others temporarily at all.

SAS Do-Loop and IF Statement to compare Current and previous row values

I’m pretty new with do loops in SAS and I know that I am trying to make this loop work like a MATLAB script. I haven’t found many helpful tips online as most of the do-loop examples are just for calculations, not actually checking to see if the row before the current one has the same value.
Here is my issue that I need to solve:
I want to look at each policy numbers below and see if the one before is the same, if it is, I want to flag it.
Policy
26X0118907
26X0375309
26X0375309
26X0527509
I would consider i=1 to be the first policy(26X0118907) and i=2 to be the second policy (26X0375309).
In this case according to the code (that doesn't work) below this increment would be flagged as ‘B’. Do you know how to properly code a situation like this?
data AF_Inforce_&thestate.;
set AF_Inforce_&thestate.;
by Rating_St;
if first.Rating_St then counter=0;
counter+1;
myloop:
do i=2 to counter;
P2(i)=Policy(i);
P1(i)=Policy(i-1);
if P1(i)=P2(i) then flag='A';
else flag='B';
end;
return;
run;
The first thing you need to learn coming from MATLAB or a similar language is that SAS is different. In particular, the DATA step is its own DO loop, looping over records.
Second, it's a bit complicated to access data accross rows. However, there are a few tricks.
Vasja showed you one (lag, which doesn't actually go to a previous record, but sort of acts like it does). dif does the same thing except it compares, so if your policynum had been numeric, Vasja's code could be rewritten as dif(policy)=0 instead of policy=lag(policy)(though this is only for numerics).
A better trick in my opinion in your case is to use by group processing. Normally this works with sorted fields, but here it doesn't matter if it's sorted: you just want to know if two consecutive rows are identical, right?
data want;
set have;
by rating_st policy notsorted;
if first.policy and last.policy then recflag='A';
else if first.rating_st then recflag='A';
else recflag='B';
run;
I don't know that I understand your rules entirely, but they're probably going to be some form of this. I put the two possibilities there, you might just want the second one (ie, you don't care if it's singular or just the first). The first would flag only singular policies.
Try looking at LAG function (it "remembers" the values of a variable in a queue)
Your code should go like this:
data AF_Inforce_&thestate.;
set AF_Inforce_&thestate.;
by Rating_St;
if first.Rating_St = 0 and Policy=LAG(Policy) then flag='A';
else flag='B';
run;

CONTRAST for CLASS variable with more than two levels in PROC GLM

Background: When we test the significance of a categorical variable that has been coded as dummy variables, we need to simultaneously test all dummy variables are 0. For example, if X takes on values of 0, 1, 2, 3 and 4, I would fit dummy variables for levels 1-4 (assuming I want 0 to be baseline), then want to simultaneously test B1=B2=B3=B4=0.
If this is the only variable in my data set, I can use the overall F-statistic to achieve this. However, if I have other covariates, the overall F-test doesn't work.
In Stata, for example, this is (very, very) simply carried out by the testparm command as:
testparm i.x (after fitting the desired regression model), where the i. prefix tells Stata X is a categorical data to be treated as dummy variables.
Question/issue: I'm wondering how I can do this in SAS with a CONTRAST (or ESTIMATE?) statement while fitting a regression model with PROC GLM. Since I have scoured the internet and haven't found what I'm looking for, I'm guessing I'm missing something very obvious. However, all of the examples I've seen are NOT for categorical (class) variables, but rather two separate (say continuous) variables. The contrast statement in that case would simply be something like
CONTRAST 'Contrast1' y 1 z 1;
Otherwise, they're for calculating hypotheses like H_0: B1-B2=0.
I feel like I need to breakdown the hypotheses into smaller pieces and determine that set that defines the whole relationship, but I'm not doing it correctly. For example, for B1=B2=B3=B4=0, I thought I might say B1=B2=B3=-B4, then define (1) B1=-B4, (2) B2=-B4 and (3) B2=B3. I was trying to code this as a CONTRAST statement as (say X is in descending order in data set: 4-0):
CONTRAST 'Contrast' x -1 0 0 1 0
x -1 0 1 0 0
x 0 1 1 0 0;
I know this is not correct, and I tried many, many variations and whatever random logic I could come up with. My problem is I have relatively novice-level knowledge of CONTRAST (and unfortunately have not found great documentation to help with this) and also of how this hypothesis test should really be formulated for the sake of estimation (do I try to split it up into pieces as I did above, or...?).
From my note above, you actually can get SAS to do this for you with PROC GENMOD and the CLASS statement and a TYPE3 specification.
proc genmod data=input;
class classvar ;
model slope= classvar othervar/ type3;
run;
quit;
In the example above, my class levels are in the classvar variable. The othervar is my other covariate.
At the end of the output, you see a table labeled LR Statistics For Type 3 Analysis. The row for classvar is the LR test of all the class effects=0.
Another case where PROC REG with TEST works (TEST x1=0, x2=0, x3=0, x4=0, e.g.), which isn't answering my initial question for PROC GLM, but is an option if PROC REG gets the job done for your type of model.