Create Lag variable with three conditions - sas

-I need a lead variable based on 3 conditions. IF variable RoaDLM has a number and IF the Co_ID is the same as the lag(co_id) and IF CEO = lag(ceo), I need a lead variable: Lead1
-i sort descending to create lag variable
-Every thing else should be '.'
-here is my code:
data RoaReg;
set RoaReg;
by CO_ID descending fyear;
if RoaDlm ne 0 and Co_ID = lag(CO_ID) and ceo=ceo then
Lead1 = lag(ROA);
else if RoaDlm= 0 then
Lead1='.';
run;
-Anyway, this does not work. Thanks!

Theres a couple of issues with your code.
Do not use the same data set name in the SET and DATA statements. This is a recipe for errors that are difficult to debug.
Lag() cannot be calculated conditionally, use it always and set to missing when necessary.
data RoaReg2;
set RoaReg;
by CO_ID descending fyear;
Lead1 = lag(ROA);
if RoaDlm= 0 then call missing (lead1);
run;
This is the correct version of your code, or my best guess. Providing sample data would help for sure.

Based on what I understood, you need a lead variable based on few conditions - two being lagged value of the variables.
You don't have a lead function in SAS, as per my knowledge. You can use proc expand for that purpose. And, you did not mention about the variable for which you want a lead - so, I am assuming it to be a variable named ROA.
So, here is my best guess/interpretation of what you want.
data RoaReg_lead;
merge RoaReg RoaReg(keep=ROA rename=(ROA=LeadROA) firstobs=2); /*merged the same table with only the ROA variable, and read the values from 2nd observation | can't use by variables in order to do so*/
Lag_co_id=lag(co_id); /*creating lagged values*/
Lag_ceo=lag(ceo);
/*conditions*/
if (RoaDLM ne . and RoaDLM>0) and co_id=Lag_co_id and ceo=Lag_ceo then
Lead1=LeadROA;
drop Lag_co_id Lag_ceo LeadROA; /*You can keep the vars to do a manual check*/
run;
Otherwise, providing a sample table of your data (have and want) would be very helpful.

Related

How to choose indexed assignment variable dynamically in SAS?

I am trying to build a custom transformation in SAS DI. This transformation will "act" on columns in an input data set, producing the desired output. For simplicity let's assume the transformation will use input_col1 to compute output_col1, input_col2 to compute output_col2, and so on up to some specified number of columns to act on (let's say 2).
In the Code Options section of the custom transformation users are able to specify (via prompts) the names of the columns to be acted on; for example, a user could specify that input_col1 should refer to the column named "order_datetime" in the input dataset, and either make a similar specification for input_col2 or else leave that prompt blank.
Here is the code I am using to generate the output for the custom transformation:
data cust_trans;
set &_INPUT0;
i=1;
do while(i<3);
call symputx('index',i);
result = myfunc("&&input_col&index");
output_col&index = result; /*what is proper syntax here?*/
i = i+1;
end;
run;
Here myfunc refers to a custom function I made using proc fcmp which works fine.
The custom transformation works fine if I do not try to take into account the variable number of input columns to act on (i.e. if I use "&&input_col&i" instead of "&&input_col&index" and just use the column result on the output table).
However, I'm having two issues with trying to make the approach more dynamic:
I get the following warning on the line containing
result = myfunc("&&input_col&index"):
WARNING: Apparent symbolic reference INDEX not resolved.
I do not know how to have the assignment to the desired output column happen dynamically; i.e., depending on the iteration of the do loop I'd like to assign the output value to the corresponding output column.
I feel confident that the solution to this must be well known amongst experts, but I cannot find anything explaining how to do this.
Any help is greatly appreciated!
You can't use macro variables that depend on data variables, in this manner. Macro variables are resolved at compile time, not at run time.
So you either have to
%do i = 1 %to .. ;
which is fine if you're in a macro (it won't work outside of an actual macro), or you need to use an array.
data cust_trans;
set &_INPUT0;
array in[2] &input_col1 &input_col2; *or however you determine the input columns;
array output_col[2]; *automatically names the results;
do i = 1 to dim(in);
result = myfunc(in[i]); *You quote the input - I cannot see what your function is doing, but it is probably wrong to do so;
output_col[i] = result; /*what is proper syntax here?*/
end;
run;
That's the way you'd normally do that. I don't know what myfunc does, and I also don't know why you quote "&&input_col&index." when you pass it to it, but that would be a strange way to operate unless you want the name of the input column as text (and don't want to know what data is in that variable). If you do, then pass vname(in[i]) which passes the name of the variable as a character.

SAS compare character records according to 2 different variable groupings

I am trying to find records which do are not grouped similarly according to 2 different variables (all variables have character format).
My variables are appln_id (unique) earliest_filing_id (groupings) docdb_family_id (groupings). The data set comprises around 25,000 different appln_id, but only 15446 different earliest_filing_id and 15755 docdb_family_id. Now you see that there's a difference of ca. 300 records among these 2 groups (potenially more because groupings might also change).
Now what I would like to do is the see all cases, which are not similarly grouped. Here an example:
appln_id earliest_filing_id docdb_family_id
10137202 10137202 30449399
10272131 10137202 30449399
10272153 10137202 !!25768424!!
You can see that the last case differs and should be on my list that I hope to create.
I was trying to solve it with either a Proc compare, a Call sortc or a by+if...then coding but failed so far to come up with a good solution.
I am not using SAS for that long yet...
Your help is super appreciated!
Grazie
Annina
Sounds like you want to use BY group processing to assign a new group variable.
Make sure your data is sorted and then run something like this to create a new GROUPID variable.
data want ;
set have ;
by EARLIEST_FILING_ID DOCDB_FAMILY_ID ;
groupid + first.docdb_family_id ;
run;
If my understanding is correct, you want to select unique docdb_family_id. Try this:
proc sql;
select * from yourfile group by docdb_family_id having count(*)=1;
quit;

first and last order SAS

I have two lines of data,
Order
17/01/2016
01/02/2014
Basically I want to run a logic like so;
data A.test_active;
set A.Weekly_Email_files_cleaned4;
length active :8.;
length inactive :8.;
if first.Order between '01Jan2014'd and '31Dec2015'd then active= 1;
if last.order between '01Jan2014'd and '31Dec2015'd then inactive= 1;
run;
the field "Order" is formatted by DDMMYY10 when I checked the file properties, but I keep getting this error
ERROR 388-185: Expecting an arithmetic operator.
Can anyone help or suggest something different in the same vain?
In SAS, between is only valid in SQL contexts: either actual PROC SQL, or WHERE statements, generally. It is not otherwise valid in SAS. You would use in (firstval:lastval) instead, if those values are integers (dates are). If they're not integers, you need to use if firstval le val le lastval or similar (can also use ge/lt/gt/>/< or whatever you like, depending on the ordering of things).
Second, first.order and last.order are boolean values - 1 or 0, nothing else, that indicate that you are on a row that is the first row for a new value when sorted by that variable, or the last row similarly. You also must have a by statement by that variable if you're going to use them.
Third, your length statements are wrong; you're confusing some three different things here, I think. Length statements for numerics aren't needed if you're using default length 8, and if you do like having them anyway, you need:
length active 8;
No : or ., both are used for different purposes.
ID first_order Order
alex 01/01/2013 23/01/2015
alex 01/01/2013 23/01/2015
alex 01/01/2013 03/04/2013
basically if an order exists after the first order that is within a certain timeframe (within a year of the date of the first order) then the user is "active"
any ideas much appreciated
thanks

Wildcard in variable list

totalSUPPLY= sum(of supply1-supply485);
Ive got this simple calculation to make (in SAS) from a table that Ive transposed (hence the variable names). I have to do this several times, and the the number of supply variables is not the same for each calculation. I.e. in the above example its 485, but I do it later in my analysis and its 350.
My question: Is there a way to 'wildcard' the number of 'supply' columns. Basically, I want something like this (but this doesnt work): totalSUPPLY= sum(of supply1-supply%);
Also: If there is an easier way do the same Im open (and would actually prefer) that.
Thanks everyone!
data yoursummary;
set yourdata; /*dataset containing supply1-supply485*/
array supplies{*} supply:;
totalSUPPLY = sum(of supplies{*});
run;
N.B. using a : wildcard like this will only pick up matching variables that are present in the PDV at the point when you create the array, so the array definition has to come after the set statement. Also, it only works for variables with a common prefix, not those with a common suffix.
As Joe has pointed out, the following more concise code also works:
data yoursummary;
set yourdata; /*dataset containing supply1-supply485*/
totalSUPPLY = sum(of supplies:);
run;
Of course, if you declare an array it's then easier to do related things like checking how many variables are being added together, or looping through the variables in the array and applying the same logic to each one in turn.

How do I delete observations with no data in Stata?

I have data with IDs which may or may not have all values present. I want to delete ONLY the observations with no data in them; if there are observations with even one value, I want to retain them. Eg, if my data set is:
ID val1 val2 val3 val4
1 23 . 24 75
2 . . . .
3 45 45 70 9
I want to drop only ID 2 as it is the only one with no data -- just an ID.
I have tried Statalist and Google but couldn't find anything relevant.
This will also work with strings as long as they are empty:
ds id*, not
egen num_nonmiss = rownonmiss(`r(varlist)'), strok
drop if num_nonmiss == 0
This gets a list of variables that are not the id and drops any observations that only have the id.
Brian Albert Monroe is quite correct that anyone using dropmiss (SJ) needs to install it first. As there is interest in varying ways of solving this problem, I will add another.
foreach v of var val* {
qui count if missing(`v')
if r(N) == _N local todrop `todrop' `v'
}
if "`todrop'" != "" drop `todrop'
Although it should be a comment under Brian's answer, I will add here a comment here as (a) this format is more suited for showing code (b) the comment follows from my code above. I agree that unab is a useful command and have often commended it in public. Here, however, it is unnecessary as Brian's loops could easily start something like
foreach v of var * {
UPDATE September 2015: See http://www.statalist.org/forums/forum/general-stata-discussion/general/1308777-missings-now-available-from-ssc-new-program-for-managing-missings for information on missings, considered by the author of both to be an improvement on dropmiss. The syntax to drop observations if and only if all values are missing is missings dropobs.
Just another way to do it which helps you discover how flexible local macros are without installing anything extra to Stata. I rarely see code using locals storing commands or logical conditions, though it is often very useful.
// Loop through all variables to build a useful local
foreach vname of varlist _all {
// We don't want to include ID in our drop condition, so don't execute the remaining code if our loop is currently on ID
if "`vname'" == "ID" continue
// This local stores all the variable names except 'ID' and a logical condition that checks if it is missing
local dropper "`dropper' `vname' ==. &"
}
// Let's see all the observations which have missing data for all variables except for ID
// The '1==1' bit is a condition to deal with the last '&' in the `dropper' local, it is of course true.
list if `dropper' 1==1
// Now let's drop those variables
drop if `dropper' 1==1
// Now check they're all gone
list if `dropper' 1==1
// They are.
Now dropmiss may be convenient once you've downloaded and installed it, but if you are writing a do file to be used by someone else, unless they also have dropmiss installed, your code won't work on their machine.
With this approach, if you remove the lines of comments and the two unnecessary list commands, this is a fairly sparse 5 lines of code which will run with Stata out of the box.