I have two lines of data,
Order
17/01/2016
01/02/2014
Basically I want to run a logic like so;
data A.test_active;
set A.Weekly_Email_files_cleaned4;
length active :8.;
length inactive :8.;
if first.Order between '01Jan2014'd and '31Dec2015'd then active= 1;
if last.order between '01Jan2014'd and '31Dec2015'd then inactive= 1;
run;
the field "Order" is formatted by DDMMYY10 when I checked the file properties, but I keep getting this error
ERROR 388-185: Expecting an arithmetic operator.
Can anyone help or suggest something different in the same vain?
In SAS, between is only valid in SQL contexts: either actual PROC SQL, or WHERE statements, generally. It is not otherwise valid in SAS. You would use in (firstval:lastval) instead, if those values are integers (dates are). If they're not integers, you need to use if firstval le val le lastval or similar (can also use ge/lt/gt/>/< or whatever you like, depending on the ordering of things).
Second, first.order and last.order are boolean values - 1 or 0, nothing else, that indicate that you are on a row that is the first row for a new value when sorted by that variable, or the last row similarly. You also must have a by statement by that variable if you're going to use them.
Third, your length statements are wrong; you're confusing some three different things here, I think. Length statements for numerics aren't needed if you're using default length 8, and if you do like having them anyway, you need:
length active 8;
No : or ., both are used for different purposes.
ID first_order Order
alex 01/01/2013 23/01/2015
alex 01/01/2013 23/01/2015
alex 01/01/2013 03/04/2013
basically if an order exists after the first order that is within a certain timeframe (within a year of the date of the first order) then the user is "active"
any ideas much appreciated
thanks
Related
I am trying to find records which do are not grouped similarly according to 2 different variables (all variables have character format).
My variables are appln_id (unique) earliest_filing_id (groupings) docdb_family_id (groupings). The data set comprises around 25,000 different appln_id, but only 15446 different earliest_filing_id and 15755 docdb_family_id. Now you see that there's a difference of ca. 300 records among these 2 groups (potenially more because groupings might also change).
Now what I would like to do is the see all cases, which are not similarly grouped. Here an example:
appln_id earliest_filing_id docdb_family_id
10137202 10137202 30449399
10272131 10137202 30449399
10272153 10137202 !!25768424!!
You can see that the last case differs and should be on my list that I hope to create.
I was trying to solve it with either a Proc compare, a Call sortc or a by+if...then coding but failed so far to come up with a good solution.
I am not using SAS for that long yet...
Your help is super appreciated!
Grazie
Annina
Sounds like you want to use BY group processing to assign a new group variable.
Make sure your data is sorted and then run something like this to create a new GROUPID variable.
data want ;
set have ;
by EARLIEST_FILING_ID DOCDB_FAMILY_ID ;
groupid + first.docdb_family_id ;
run;
If my understanding is correct, you want to select unique docdb_family_id. Try this:
proc sql;
select * from yourfile group by docdb_family_id having count(*)=1;
quit;
Consider the fictional data to illustrate my problem, which contains in reality thousands of rows.
Figure 1
Each individual is characterized by values attached to A,B,C,D,E. In figure1, I show 3 individuals for which some characteristics are missing. Do you have any idea how can I get the following completed table (figure 2)?
Figure 2
With the ID in figure 1 I could have used the carryforward command to filling in the values. But since each individual has a different number of rows I don't know how to create the ID.
Edit: All individual share the characteristic "A".
Edit: the existing order of observations is informative.
To detect the change of id, the idea is to compare if the precedent value of char is >= in each rows.
This works only if your data are ordered, but it seems mandatory in your data.
gen id= 1 if (char[_n-1] >= char[_n]) | _n ==1
replace id = sum(id) if id==1
replace id = id[_n-1] if missing(id)
fillin id char
drop _fillin
If an individual as only the characteristics A and C and another individual as only the characteristics D and E, this won't work, but it seems impossible to detect with your data.
I’m pretty new with do loops in SAS and I know that I am trying to make this loop work like a MATLAB script. I haven’t found many helpful tips online as most of the do-loop examples are just for calculations, not actually checking to see if the row before the current one has the same value.
Here is my issue that I need to solve:
I want to look at each policy numbers below and see if the one before is the same, if it is, I want to flag it.
Policy
26X0118907
26X0375309
26X0375309
26X0527509
I would consider i=1 to be the first policy(26X0118907) and i=2 to be the second policy (26X0375309).
In this case according to the code (that doesn't work) below this increment would be flagged as ‘B’. Do you know how to properly code a situation like this?
data AF_Inforce_&thestate.;
set AF_Inforce_&thestate.;
by Rating_St;
if first.Rating_St then counter=0;
counter+1;
myloop:
do i=2 to counter;
P2(i)=Policy(i);
P1(i)=Policy(i-1);
if P1(i)=P2(i) then flag='A';
else flag='B';
end;
return;
run;
The first thing you need to learn coming from MATLAB or a similar language is that SAS is different. In particular, the DATA step is its own DO loop, looping over records.
Second, it's a bit complicated to access data accross rows. However, there are a few tricks.
Vasja showed you one (lag, which doesn't actually go to a previous record, but sort of acts like it does). dif does the same thing except it compares, so if your policynum had been numeric, Vasja's code could be rewritten as dif(policy)=0 instead of policy=lag(policy)(though this is only for numerics).
A better trick in my opinion in your case is to use by group processing. Normally this works with sorted fields, but here it doesn't matter if it's sorted: you just want to know if two consecutive rows are identical, right?
data want;
set have;
by rating_st policy notsorted;
if first.policy and last.policy then recflag='A';
else if first.rating_st then recflag='A';
else recflag='B';
run;
I don't know that I understand your rules entirely, but they're probably going to be some form of this. I put the two possibilities there, you might just want the second one (ie, you don't care if it's singular or just the first). The first would flag only singular policies.
Try looking at LAG function (it "remembers" the values of a variable in a queue)
Your code should go like this:
data AF_Inforce_&thestate.;
set AF_Inforce_&thestate.;
by Rating_St;
if first.Rating_St = 0 and Policy=LAG(Policy) then flag='A';
else flag='B';
run;
totalSUPPLY= sum(of supply1-supply485);
Ive got this simple calculation to make (in SAS) from a table that Ive transposed (hence the variable names). I have to do this several times, and the the number of supply variables is not the same for each calculation. I.e. in the above example its 485, but I do it later in my analysis and its 350.
My question: Is there a way to 'wildcard' the number of 'supply' columns. Basically, I want something like this (but this doesnt work): totalSUPPLY= sum(of supply1-supply%);
Also: If there is an easier way do the same Im open (and would actually prefer) that.
Thanks everyone!
data yoursummary;
set yourdata; /*dataset containing supply1-supply485*/
array supplies{*} supply:;
totalSUPPLY = sum(of supplies{*});
run;
N.B. using a : wildcard like this will only pick up matching variables that are present in the PDV at the point when you create the array, so the array definition has to come after the set statement. Also, it only works for variables with a common prefix, not those with a common suffix.
As Joe has pointed out, the following more concise code also works:
data yoursummary;
set yourdata; /*dataset containing supply1-supply485*/
totalSUPPLY = sum(of supplies:);
run;
Of course, if you declare an array it's then easier to do related things like checking how many variables are being added together, or looping through the variables in the array and applying the same logic to each one in turn.
I'm afraid I'm running across the following:
Method 1:
proc sql;
create table as
...
compged(a.plan_id, b.plan_id,&maxscore.,'iL') as gedscore
from view_a a, view_b b
where a.state = b.state and calculated gedscore < &maxscore.
order by calculated gedscore;
This works, it's all fine and dandy, but I would like to adjust my results slightly with compcost. So I adopt Method 2:
proc sql;
create view tempview as select
...
from view_a a, view_b b
where a.state = b.state;
quit;
data modified_gedscore
set tempview;
if _N_ = 1 then call compcost('delete=',10,'truncate=',10);
gedscore = compged(el_plan, clms_plan,&maxscore.,'iL');
if gedscore < &maxscore.;
run;
There's a bit more to it, but I've tried to isolate the relevant bits. I have tried to decrease the cost of the operations delete and truncate (it makes sense given the data I'm working with and what I'm trying to accomplish). My expected result would be due to delete and truncate operations having a lower cost, more observations would have a gedscore < &maxscore. However, I'm afraid I am seeing the following: the call compcost is actually dramatically decreasing the number of observations I see. Do I have a basic misunderstanding as to how call compcost works? If the above is incorrect, how would I adjust compged to have deletion of characters be more likely to fall under the maxscore threshold?
Edit: Also, I understand that the different structuring of the two methods would raise the possibility of something other than call compcost causing the unexpected results, but if I simply comment out the call compcost line I get results equivalent to that in Method 1. So, nope.
Edit2: sample data. First observation is equivalent (0). Second yields higher gedscore under method 2 than method 1, even though the compcost of delete and truncate has been lowered, with no other changes.
data sample_data;
input state1 $ plan1 $ plan2 $;
datalines;
ID DENTAL DENTAL
GA GBHC GBCH
;
Edit3: I think I may have found the problem. It appears that the default compged costs (here) are different from the default compcost costs (here). When compcost is called, all operations not specified are set to the compcost defaults, which are usually higher. If anybody feels like confirming, feel free.
Thanks for your help
The issue is that COMPGED is not using the SWAP cost, but instead only using DELETE and INSERT (the latter of which costs 100). That's because of how CALL COMPCOST works; for some reason (that makes little sense to me), CALL COMPCOST's default values are not equal to COMPGED's default values - and it inserts a default value into every other operation that you do not specify.
In order to make this work, it looks like you'll have to specify a value for everything that you want it to use, in particular, APPEND, BLANK, PUNCTUATION, SINGLE, SWAP, and TRUNCATE (the latter of which you do specify already). From the doc, as of 9.2, the defaults were 50,10,30,20,20,10 for COMPGED for those.
In your example:
data sample_data;
input state1 $ plan1 $ plan2 $;
call compcost('del=',10,'truncate=',10,'swap=',20);
compged_1 = compged(plan1,plan2,'il');
put compged_1=;
datalines;
ID DENTAL DENTAL
GA GBHC GBCH
;
run;
Now returns 20 instead of 110.