Comparing occurances on two observations on SAS - sas

Hi I have a data treatment problem on SAS. I have transaction history for every customer and i have also created a Customer_Tranx_Number. In addition to this i have flagged every transaction with a 1/0 occurrence of an event flag.
Now I want to find in which observation does the event flag changes from 1 to 0 and flag the event which shows first "0" after "1". Also i have to do this flag creation for every customer separately
How can I code this on SAS?
I tried to illustrate problem on the following link, in advance thank yo very much for all your help.
http://zeybekomer.blogspot.com.tr/2015/10/blog-post_12.html
Regards

DATA NEW;
SET YOURS;
IF LAG1(CUST_ID)=CUST_ID AND LAG1(FLAG_1) != FLAG_1 THEN NEW_FLAG="FLAG=1";
RUN;
That code will check to see if the it is the same customer. Then check if current record flag is equal to the prior record flag for the "FLAG_1" variable.
You can get more specific if needed with specifying additional boolean logic such as when prior value of flag_1 is 1 and current is 0, then define ect....

Related

PDI - Check data types of field

I'm trying to create a transformation read csv files and check data types for each field in that csv.
Like this : the standard field A should string(1) character and field B is integer/number.
And what I want is to check/validate: If A not string(1) then set Status = Not Valid also if B not a integer/number to. Then all file with status Not Valid will be moved to error folder.
I know I can use Data Validator to do it, but how to move the file with that status? I can't find any step to do it.
You can read files in loop, and
add step as below,
after data validation, you can filter rows with the negative result(not matched) -> add constant values step and with error = 1 -> add set variable step for error field with default values 0.
after transformation finishes, you can do add simple evaluation step in parent job to check value of ERROR variable.
If it has value 1 then move files else ....
I hope this can help.
You can do same as in this question. Once read use the Group by to have one flag per file. However, this time you cannot do it in one transform, you should use a job.
Your use case is in the samples that was shipped with your PDI distribution. The sample is in the folder your-PDI/samples/jobs/run_all. Open the Run all sample transformations.kjb and replace the Filter 2 of the Get Files - Get all transformations.ktr by your logic which includes a Group by to have one status per file and not one status per row.
In case you wonder why you need such a complex logic for such a task, remember that the PDI starts all the steps of a transformation at the same time. That's its great power, but you do not know if you have to move the file before every row has been processed.
Alternatively, you have the quick and dirty solution of your similar question. Change the filter row by a type check, and the final Synchronize after merge by a Process File/Move
And a final advice: instead of checking the type with a Data validator, which is a good solution in itself, you may use a Javascript like
there. It is more flexible if you need maintenance on the long run.

Stata code to conditionally sum values based on a group rank

I'm trying to write a code for a fairly huge dataset (3m observations) which has been segregated into smaller groups (ID). For each observation (described in the table below), I want to create a cumulative sum of a variable "Value" for all observations ranked below me, subject to condition of the lower ranked observation equals mine.
[
I want to write this code without using loops, if there is a way to do so.
Could someone help me?
Thank you!
UPDATE:
I have pasted the equation for the output variable below.
UPDATE 2:
The CSV format of the above table is:
ID,Rank,Condition,Value,Expected output,,
1,1,30,10,0,,
1,2,40,20,0,,
1,3,20,30,0,,
1,4,30,40,10,,
1,5,40,50,20,,
1,6,20,60,30,,
1,7,30,70,80,,
2,1,40,80,0,,
2,2,20,90,0,,
2,3,30,100,0,,
2,4,40,110,80,,
2,5,20,120,90,,
2,6,30,130,100,,
2,7,40,140,190,,
2,8,20,150,210,,
2,9,30,160,230,,
Equation
If I understand correctly, for each combination of ID and Condition, you want to calculate a running sum, ordered by Rank, of the variable Value, excluding the current observation. If that is indeed your goal, the following untested code might set you on the path to a solution
sort ID Condition Rank
// be sure there is a single observation for each combination
isid ID Condition Rank
// generate the running sum
by ID Condition (Rank): generate output = sum(Value)
// subtract out the current observation
replace output = output - Value
// return to the original order
sort ID Rank
As I said, this is untested, because my copy of Stata cannot read pictures of data. If your testing shows that it is imperfect and you cannot resolve the problem yourself, providing your sample data in a usable format will increase the likelihood someone will be able to help.
Added in edit: Corrected the isid command.

first and last order SAS

I have two lines of data,
Order
17/01/2016
01/02/2014
Basically I want to run a logic like so;
data A.test_active;
set A.Weekly_Email_files_cleaned4;
length active :8.;
length inactive :8.;
if first.Order between '01Jan2014'd and '31Dec2015'd then active= 1;
if last.order between '01Jan2014'd and '31Dec2015'd then inactive= 1;
run;
the field "Order" is formatted by DDMMYY10 when I checked the file properties, but I keep getting this error
ERROR 388-185: Expecting an arithmetic operator.
Can anyone help or suggest something different in the same vain?
In SAS, between is only valid in SQL contexts: either actual PROC SQL, or WHERE statements, generally. It is not otherwise valid in SAS. You would use in (firstval:lastval) instead, if those values are integers (dates are). If they're not integers, you need to use if firstval le val le lastval or similar (can also use ge/lt/gt/>/< or whatever you like, depending on the ordering of things).
Second, first.order and last.order are boolean values - 1 or 0, nothing else, that indicate that you are on a row that is the first row for a new value when sorted by that variable, or the last row similarly. You also must have a by statement by that variable if you're going to use them.
Third, your length statements are wrong; you're confusing some three different things here, I think. Length statements for numerics aren't needed if you're using default length 8, and if you do like having them anyway, you need:
length active 8;
No : or ., both are used for different purposes.
ID first_order Order
alex 01/01/2013 23/01/2015
alex 01/01/2013 23/01/2015
alex 01/01/2013 03/04/2013
basically if an order exists after the first order that is within a certain timeframe (within a year of the date of the first order) then the user is "active"
any ideas much appreciated
thanks

How do I delete observations with no data in Stata?

I have data with IDs which may or may not have all values present. I want to delete ONLY the observations with no data in them; if there are observations with even one value, I want to retain them. Eg, if my data set is:
ID val1 val2 val3 val4
1 23 . 24 75
2 . . . .
3 45 45 70 9
I want to drop only ID 2 as it is the only one with no data -- just an ID.
I have tried Statalist and Google but couldn't find anything relevant.
This will also work with strings as long as they are empty:
ds id*, not
egen num_nonmiss = rownonmiss(`r(varlist)'), strok
drop if num_nonmiss == 0
This gets a list of variables that are not the id and drops any observations that only have the id.
Brian Albert Monroe is quite correct that anyone using dropmiss (SJ) needs to install it first. As there is interest in varying ways of solving this problem, I will add another.
foreach v of var val* {
qui count if missing(`v')
if r(N) == _N local todrop `todrop' `v'
}
if "`todrop'" != "" drop `todrop'
Although it should be a comment under Brian's answer, I will add here a comment here as (a) this format is more suited for showing code (b) the comment follows from my code above. I agree that unab is a useful command and have often commended it in public. Here, however, it is unnecessary as Brian's loops could easily start something like
foreach v of var * {
UPDATE September 2015: See http://www.statalist.org/forums/forum/general-stata-discussion/general/1308777-missings-now-available-from-ssc-new-program-for-managing-missings for information on missings, considered by the author of both to be an improvement on dropmiss. The syntax to drop observations if and only if all values are missing is missings dropobs.
Just another way to do it which helps you discover how flexible local macros are without installing anything extra to Stata. I rarely see code using locals storing commands or logical conditions, though it is often very useful.
// Loop through all variables to build a useful local
foreach vname of varlist _all {
// We don't want to include ID in our drop condition, so don't execute the remaining code if our loop is currently on ID
if "`vname'" == "ID" continue
// This local stores all the variable names except 'ID' and a logical condition that checks if it is missing
local dropper "`dropper' `vname' ==. &"
}
// Let's see all the observations which have missing data for all variables except for ID
// The '1==1' bit is a condition to deal with the last '&' in the `dropper' local, it is of course true.
list if `dropper' 1==1
// Now let's drop those variables
drop if `dropper' 1==1
// Now check they're all gone
list if `dropper' 1==1
// They are.
Now dropmiss may be convenient once you've downloaded and installed it, but if you are writing a do file to be used by someone else, unless they also have dropmiss installed, your code won't work on their machine.
With this approach, if you remove the lines of comments and the two unnecessary list commands, this is a fairly sparse 5 lines of code which will run with Stata out of the box.

Stata command to add all choices, those made and those not made

UPDATE:
I solved the first part of the problem. I created unique ids for each observation:
gen id=_n
Then, I used
fillin id categ
which essentially created what I was looking for.
However, for the rest of the variables (except id and categ), almost all observations are missing. Now, I need your help to duplicate the rest of the variables instead of having them missing.
Just as an example, each observation is associated with a particular week. I am missing most of them. Or another dummy variable indicates whether a purchase was made at a drug or grocery store. Most of them are missing too.
Thanks!
ORIGINAL MESSAGE:
Need your help in Stata!
Each observation in my database is a 1-unit purchase of a beer product made by a customer. These product purchases are categorized unto 8 general categories such that the variable "categ" has values from 1 to 8 (1=import, 2=craft, 3=premium, 4=light, etc).
For my multinomial logit model, I need to observe all categories purchased or not purchased by the customer in each observation.
Assume, this is my initial dataset:
customer id-------beer category-----units purchased
----------1------------------1--------------------- 1
----------2----------------- 3--------------------- 1
----------3 -----------------2 ---------------------1
This is what I am looking for:
customer id-------beer category-----units purchased
----------1------------------1--------------------- 1
----------1 -----------------2 ---------------------0
----------1----------------- 3--------------------- 0
----------2----------------- 1--------------------- 0
----------2----------------- 3--------------------- 1
----------2 -----------------3--------------------- 0
----------3----------------- 1--------------------- 0
----------3----------------- 2--------------------- 0
----------3 -----------------2 ---------------------1
Currently, my dataset is 600,000 obs. After this procedure, I should have 600,000*8=4,800,000 obs.
When constructing this code, it is necessary that all other variables in the dataset are duplicated according to the associated category of beer.
I assume that "fillin" and less likely "expand" might work.
You help will tremendously help.
Thanks!
This is an old question, but i'll post a possible answer if someone else is having this problem.
In this case, you could generate variables for every option of your "choice variable", and after that, apply the reshape long command:
tab beercategory, gen(b)
reshape long b , i(customerid) j(newvarname)
Greetings