Creating an ID based on factor and filling down with Stata - stata

Consider the fictional data to illustrate my problem, which contains in reality thousands of rows.
Figure 1
Each individual is characterized by values attached to A,B,C,D,E. In figure1, I show 3 individuals for which some characteristics are missing. Do you have any idea how can I get the following completed table (figure 2)?
Figure 2
With the ID in figure 1 I could have used the carryforward command to filling in the values. But since each individual has a different number of rows I don't know how to create the ID.
Edit: All individual share the characteristic "A".
Edit: the existing order of observations is informative.

To detect the change of id, the idea is to compare if the precedent value of char is >= in each rows.
This works only if your data are ordered, but it seems mandatory in your data.
gen id= 1 if (char[_n-1] >= char[_n]) | _n ==1
replace id = sum(id) if id==1
replace id = id[_n-1] if missing(id)
fillin id char
drop _fillin
If an individual as only the characteristics A and C and another individual as only the characteristics D and E, this won't work, but it seems impossible to detect with your data.

Related

How can one create a single table from multiple datasets?

I'm trying to create a descriptive table by treatment group. For my analysis, I have 3 different partitions of the data (because I'm running 3 separate analyses) from a complete data set, but I only have one statistic from each subset that I am trying to describe, so I think it'd look better in one complete table. At the end, I'd like an output that can convert to latex (as I'm using bookdown).
I've been using the compareGroups package to easily create each table individually. I know that there is an rbind function that allows to create a stacked table, but it won't let me combine them because the n of each separate data frame is different (due to missingness). For instance, I'm trying to study marriage in one of my analyses, and later divorce (which is a separate analysis), and so the n's of these two data frames differ, but the definition of treatment group is the same.
Ideally, I'd have two columns, one for the treatment group and one for the control group. There would be two rows, one that has age of first marriage, and the second row which would have length of that first marriage, and then the respective ns of the cells.
library(compareGroups)
d1 <- compareGroups(treat ~ time1mar,
data = nlsy.mar,
simplify=TRUE,
na.action=na.omit) %>% createTable(.,
type=1,
show.p.overall = FALSE)
d2 <- compareGroups(treat ~ time1div,
data = nlsy.div,
simplify=TRUE,
na.action=na.omit) %>% createTable(.,
type=1,
show.p.overall = FALSE)
d.tot <- rbind(`First Age at Marriage` = d1, `Length of First Marriage` = d2)
This is the error that I get:
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 6626, 5057
Any suggestions?
The problem might be that you're using na.omit which delets the cases/rows with NAs from both of your datasets. Probably a different amount of cases get removed from each data set. But actually different numbers of row should only be a problem with cbind. However you might try to change the na.action option.
I'm just guessing. As said by joshpk without sample data is difficult to reproduce your problem.

Stata code to conditionally sum values based on a group rank

I'm trying to write a code for a fairly huge dataset (3m observations) which has been segregated into smaller groups (ID). For each observation (described in the table below), I want to create a cumulative sum of a variable "Value" for all observations ranked below me, subject to condition of the lower ranked observation equals mine.
[
I want to write this code without using loops, if there is a way to do so.
Could someone help me?
Thank you!
UPDATE:
I have pasted the equation for the output variable below.
UPDATE 2:
The CSV format of the above table is:
ID,Rank,Condition,Value,Expected output,,
1,1,30,10,0,,
1,2,40,20,0,,
1,3,20,30,0,,
1,4,30,40,10,,
1,5,40,50,20,,
1,6,20,60,30,,
1,7,30,70,80,,
2,1,40,80,0,,
2,2,20,90,0,,
2,3,30,100,0,,
2,4,40,110,80,,
2,5,20,120,90,,
2,6,30,130,100,,
2,7,40,140,190,,
2,8,20,150,210,,
2,9,30,160,230,,
Equation
If I understand correctly, for each combination of ID and Condition, you want to calculate a running sum, ordered by Rank, of the variable Value, excluding the current observation. If that is indeed your goal, the following untested code might set you on the path to a solution
sort ID Condition Rank
// be sure there is a single observation for each combination
isid ID Condition Rank
// generate the running sum
by ID Condition (Rank): generate output = sum(Value)
// subtract out the current observation
replace output = output - Value
// return to the original order
sort ID Rank
As I said, this is untested, because my copy of Stata cannot read pictures of data. If your testing shows that it is imperfect and you cannot resolve the problem yourself, providing your sample data in a usable format will increase the likelihood someone will be able to help.
Added in edit: Corrected the isid command.

SAS compare character records according to 2 different variable groupings

I am trying to find records which do are not grouped similarly according to 2 different variables (all variables have character format).
My variables are appln_id (unique) earliest_filing_id (groupings) docdb_family_id (groupings). The data set comprises around 25,000 different appln_id, but only 15446 different earliest_filing_id and 15755 docdb_family_id. Now you see that there's a difference of ca. 300 records among these 2 groups (potenially more because groupings might also change).
Now what I would like to do is the see all cases, which are not similarly grouped. Here an example:
appln_id earliest_filing_id docdb_family_id
10137202 10137202 30449399
10272131 10137202 30449399
10272153 10137202 !!25768424!!
You can see that the last case differs and should be on my list that I hope to create.
I was trying to solve it with either a Proc compare, a Call sortc or a by+if...then coding but failed so far to come up with a good solution.
I am not using SAS for that long yet...
Your help is super appreciated!
Grazie
Annina
Sounds like you want to use BY group processing to assign a new group variable.
Make sure your data is sorted and then run something like this to create a new GROUPID variable.
data want ;
set have ;
by EARLIEST_FILING_ID DOCDB_FAMILY_ID ;
groupid + first.docdb_family_id ;
run;
If my understanding is correct, you want to select unique docdb_family_id. Try this:
proc sql;
select * from yourfile group by docdb_family_id having count(*)=1;
quit;

first and last order SAS

I have two lines of data,
Order
17/01/2016
01/02/2014
Basically I want to run a logic like so;
data A.test_active;
set A.Weekly_Email_files_cleaned4;
length active :8.;
length inactive :8.;
if first.Order between '01Jan2014'd and '31Dec2015'd then active= 1;
if last.order between '01Jan2014'd and '31Dec2015'd then inactive= 1;
run;
the field "Order" is formatted by DDMMYY10 when I checked the file properties, but I keep getting this error
ERROR 388-185: Expecting an arithmetic operator.
Can anyone help or suggest something different in the same vain?
In SAS, between is only valid in SQL contexts: either actual PROC SQL, or WHERE statements, generally. It is not otherwise valid in SAS. You would use in (firstval:lastval) instead, if those values are integers (dates are). If they're not integers, you need to use if firstval le val le lastval or similar (can also use ge/lt/gt/>/< or whatever you like, depending on the ordering of things).
Second, first.order and last.order are boolean values - 1 or 0, nothing else, that indicate that you are on a row that is the first row for a new value when sorted by that variable, or the last row similarly. You also must have a by statement by that variable if you're going to use them.
Third, your length statements are wrong; you're confusing some three different things here, I think. Length statements for numerics aren't needed if you're using default length 8, and if you do like having them anyway, you need:
length active 8;
No : or ., both are used for different purposes.
ID first_order Order
alex 01/01/2013 23/01/2015
alex 01/01/2013 23/01/2015
alex 01/01/2013 03/04/2013
basically if an order exists after the first order that is within a certain timeframe (within a year of the date of the first order) then the user is "active"
any ideas much appreciated
thanks

Stata command to add all choices, those made and those not made

UPDATE:
I solved the first part of the problem. I created unique ids for each observation:
gen id=_n
Then, I used
fillin id categ
which essentially created what I was looking for.
However, for the rest of the variables (except id and categ), almost all observations are missing. Now, I need your help to duplicate the rest of the variables instead of having them missing.
Just as an example, each observation is associated with a particular week. I am missing most of them. Or another dummy variable indicates whether a purchase was made at a drug or grocery store. Most of them are missing too.
Thanks!
ORIGINAL MESSAGE:
Need your help in Stata!
Each observation in my database is a 1-unit purchase of a beer product made by a customer. These product purchases are categorized unto 8 general categories such that the variable "categ" has values from 1 to 8 (1=import, 2=craft, 3=premium, 4=light, etc).
For my multinomial logit model, I need to observe all categories purchased or not purchased by the customer in each observation.
Assume, this is my initial dataset:
customer id-------beer category-----units purchased
----------1------------------1--------------------- 1
----------2----------------- 3--------------------- 1
----------3 -----------------2 ---------------------1
This is what I am looking for:
customer id-------beer category-----units purchased
----------1------------------1--------------------- 1
----------1 -----------------2 ---------------------0
----------1----------------- 3--------------------- 0
----------2----------------- 1--------------------- 0
----------2----------------- 3--------------------- 1
----------2 -----------------3--------------------- 0
----------3----------------- 1--------------------- 0
----------3----------------- 2--------------------- 0
----------3 -----------------2 ---------------------1
Currently, my dataset is 600,000 obs. After this procedure, I should have 600,000*8=4,800,000 obs.
When constructing this code, it is necessary that all other variables in the dataset are duplicated according to the associated category of beer.
I assume that "fillin" and less likely "expand" might work.
You help will tremendously help.
Thanks!
This is an old question, but i'll post a possible answer if someone else is having this problem.
In this case, you could generate variables for every option of your "choice variable", and after that, apply the reshape long command:
tab beercategory, gen(b)
reshape long b , i(customerid) j(newvarname)
Greetings