I have a data set that looks like this:
A B
0 1
0 1
0 1
1 0
I want to create new variables A't' and B't' for t=1,2,3 that give A and B values for the past 1,2 and 3 periods. I tried the following code but I get the error: "A invalid name.
local status A B
foreach x of local status {
forvalues t=1/3 {
gen "`x'"`t'="`x'"[_n-`t'] if _n>`t'
}
}
And the outcome I would like to get is the following:
A B A1 A2 A3 B1 B2 B3
0 1 . . . . . .
1 0 0 . . 1 . .
0 1 1 0 . 0 1 .
1 0 0 1 0 1 0 1
This works:
clear
input A B
0 1
0 1
0 1
1 0
end
foreach x in A B {
forval t = 1/3 {
gen `x'`t' = `x'[_n-`t']
}
}
Notes:
Putting two variable names into a local only to take them out again does no harm, but is pointless otherwise.
The double quotes are wrong in this context.
The if qualifier would do no harm but you get the same result without it.
Most crucially, experienced Stata users would not do this. The idea of values one previous, two previous, and so forth only makes sense if the observations are in time or another sequence order, in which case most analyses require an explicit time-like variable, say
gen t = _n
after which you can go
tsset t
and the lagged variables are then automatically available as L1.A L2.A L3.A and so forth.
If your real data are panel or longitudinal data then you need an identifier as well as a time variable.
Related
I have been so confused on how to implement this in SAS. I am trying to create duplicate rows if the value of "2" occurs more than once between the variables (member1 -member4). For example, if a row has the value 2 in member2, member3, and member4, then I will create 2 duplicate rows since the initial row will serve for the first variable and the duplicate rows will be for member 3 and 4. On the duplicate row for member3 for example, member 2 and 4 will be missing if their values is equal to 2. Basically the value "2" can only occur once per row. let's assume sa1 to sa4 corresponds to other variables of member1 to member4 respectively. When we create a duplicate row for each member, the other variables should be missing if they have a value of "1". For example, if the duplicate row is for member 3, then values that equal "1" for sa1, sa2 and sa4 should be set to missing. There are other variables in the dataset that will have same values for all duplicate rows as initial rows. Duplicate rows will also have a suffix for the ID to indicate the parent rows.
This is an example of the data I have
id member1 member2 member3 member4 sa1 sa2 sa3 sa4
1 0 2 2 0 0 1 1 0
2 2 2 0 5 . 1 0 0
3 2 2 3 2 1 1 0 1
Then this is the output I am trying to achieve
id member1 member2 member3 member4 sa1 sa2 sa3 sa4
1 0 2 . 0 0 1 . 0
1_1 0 . 2 0 0 . 1 0
2 2 . 0 5 . . 0 0
2_1 . 2 0 5 . 1 0 0
3 2 . 3 . 1 . 0 .
3_1 . 2 3 . . 1 0 .
3_2 . . 3 2 . . 0 1
Will appreciate any help. Thank you!
You need to count the number of '2's. You also need to remember where they used to be. "I had the spots removed for good luck, but I remember where the spots formerly were."
data have ;
input id :$10. member1 member2 member3 member4 sa1 sa2 sa3 sa4 ;
cards;
1 0 2 2 0 0 1 1 0
2 2 2 0 5 . 1 0 0
3 2 2 3 2 1 1 0 1
4 2 0 0 0 . . . .
5 0 0 0 0 . . . .
;
data want ;
set have ;
array m member1-member4 ;
array x [4] _temporary_;
do index=1 to dim(m);
x[index]=m[index]=2;
end;
n2 = sum(of x[*]);
if n2<2 then output;
else do counter=1 to n2;
id=scan(id,1,'_');
if counter > 1 then id=catx('_',id,counter-1);
counter2=0;
do index=1 to dim(m);
if x[index] then do;
counter2+1;
if counter = counter2 then m[index]=2;
else m[index]=.;
end;
end;
output;
end;
drop index n2 counter counter2;
run;
Results
Obs id member1 member2 member3 member4 sa1 sa2 sa3 sa4
1 1 0 2 . 0 0 1 1 0
2 1_1 0 . 2 0 0 1 1 0
3 2 2 . 0 5 . 1 0 0
4 2_1 . 2 0 5 . 1 0 0
5 3 2 . 3 . 1 1 0 1
6 3_1 . 2 3 . 1 1 0 1
7 3_2 . . 3 2 1 1 0 1
8 4 2 0 0 0 . . . .
9 5 0 0 0 0 . . . .
I think your expecting us to code the whole thing for you... I dont get your logic explanation of what you want - but to start off with:
create a new dataset
rename all the variables on the way in - prefix with O_ (Original)
code however you like to see how many values contain 2 (HOWMANYTWOS)
do ROW = 1 to HOWMANYTWOS
4.1 again go through the values on the O_ variables you have
4.2 if the ROW - corresponds to your increasing counter its the 2 you wish to keep and so you dont touch it - if the 2 does not correspond to your ROW - make it .
4.3 output the record with a new(if required) ID
a start for you:
data NEW;
set ORIG (rename=(MEMBER1-MEMBER4=O_MEMBER1-O_MEMBER4 ID=O_ID etc..)
HOWMANYTWOS = sum(O_MEMBER1=2,O_MEMBER2=2,O_MEMBER3=2,O_MEMBER4=2);
do ROW = 1 to HOWMANYTWOS; /* This is stepping through and creating the new rows - you need to step through the variables to see if you want to make them null before outputting... NOTE do not change O_ variables only create/update the variables going to the output dataset (The O_ version is for checking against only)
ID = ifc(ROW = 1, O_ID, catx("_", O_ID, ROW);
/* create a counter
output;
end;
run;
Sorry - Not got sas here and its been a little while
I have a variable with IDs:
clear
input ID
1
.
2
1
.
3
4
5
4
4
6
end
How can I create separate categorical variables with ID as a name and values of 1 and 2 (the latter if the generated variable matches the ID)?
For example, variable _ID_1 should look as follows:
2
.
1
2
.
1
1
1
1
1
1
Any ideas?
Another way to do it:
clear
input ID
1
.
2
1
.
3
4
5
4
4
6
end
forvalues j = 1/6 {
generate ID_`j' = 1 + (ID == `j') if ID != .
}
list
Imagine the following Stata data structure:
input x y
1 3
1 .
1 .
2 3
2 .
2 .
. 3
end
I want to fill the missing values using the corresponding match of pairs for other observations. However, if there is ambiguity (in the example, 3 corresponding to both 1 and 2), the code should not copy. In my example, the final data structure should look like this:
1 3
1 3
1 3
2 3
2 3
2 3
. 3
Note that both 1 and 2 are filled, as they are unambiguously 3.
My data is only numeric, and the number of unique values of variables x and y is large, so I am looking for a general rule that works in every case.
I am thinking on using the user-written command carryforward, running something like
bysort x: carryforward y if x != . , replace dynamic_condition(x[_n-1] == x[_n]) strict
bysort y: carryforward x if y != . , replace dynamic_condition(y[_n-1] == y[_n]) strict
Yet, this does not work when there are double matches.
UPDATE: the solution proposed by Nick does not work for every example. I updated the example to reflect this. The reason why the proposed solution does not work is because the function tag puts a 1 only at one instance of each value. Thus, when a value (3) is related to two values (1, 2), the tag will appear only in one of them. Hence, the copying occurs for one. In the example above, Nick's code and results are:
egen tagy = tag(y) if !missing(y)
egen tagx = tag(x) if !missing(x)
egen ny = total(tagy), by(x)
egen nx = total(tagx), by(y)
bysort x (y) : replace y = y[1] if ny == 1
bysort y (x) : replace x = x[1] if nx == 1
list, sep(0)
+-------------------------------+
| x y tagy tagx ny nx |
|-------------------------------|
1. | 1 3 0 0 1 0 |
2. | 1 3 0 0 1 0 |
3. | 1 3 1 1 1 2 |
4. | 2 3 0 1 0 2 |
5. | . 3 0 0 0 2 |
6. | 2 . 0 0 0 0 |
7. | 2 . 0 0 0 0 |
+-------------------------------+
As seen, the code works for filling x=1 and not filling y=3 (line 5). Yet, it does not fill lines 6 and 7 because tagy=1 only appears once (x=1).
This is a bit clunky, but it should work:
bysort x: egen temp=sd(x) if x!=.
bysort x (y): replace y=y[1] if temp==0
drop temp
Since the standard deviation of a constant is zero, temp=0 if non-missing x's are all the same.
sort x, y
replace y = y[_n-1] if missing(y) & x[_n-1] == x[_n]
Consider the following example:
input group day month year number treatment NUM
1 1 2 2000 1 1 2
1 1 6 2000 2 0 .
1 1 9 2000 3 0 .
1 1 5 2001 4 0 .
1 1 1 2010 5 1 1
1 1 5 2010 6 0 .
2 1 1 2001 1 1 0
2 1 3 2002 2 1 0
end
gen date = mdy(month,day,year)
format date %td
drop day month year
For each group, I have a varying number of observations. Each observations refers to an event that is specified with a date. Variable number is the numbering within each group.
Now, I want to count the number of observations that occur one year starting from the date of each treatment observation (excluding itself) within this group. This means, I want to create the variable NUM that I have already put into my example above. I do not care about the number of observations with treatment = 0.
EDIT Begin: The following information was found to be missing but necessary to tackle this problem: The treatment variable will have a value of 1 if there is no observation within the same group in the last year. Thus it is also not possible that the variable NUM will have to consider observations with treatment = 1. In principal, it is possible that there are two observations within a group that have identical dates. EDIT End
I have looked into Stata tip 51: Events in intervals. It seems to work out however my dataset is huge (> 1 mio observations) such that it is really really inefficient - especially because I do not care about all treatment = 0 observations.
I was wondering if there is any alternative. My approach was to look for the observation with the latest date within each group that is still in the range of 1 year (and maybe store it in variable latestDate). Then I would simply subtract the value in variable number of the observation found from the value in count of the treatment = 0 variable.
Note: My "inefficient" code looks as follows
gsort -treatment
gen treatment_id = _n
replace treatment_id = . if treatment==0
gen count=.
sum treatment_id, meanonly
qui forval i = 1/`r(max)'{
count if inrange(date-date[`i'],1,365) & group == group[`i']
replace count = r(N) in `i'
}
sort group date
I am assuming that treatment can't occur within 1 year of the previous treatment (in the group). This is true in your example data, but may not be true in general. But, assuming that it is the case, then this should work. I'm using carryforward which is on SSC (ssc install carryforward). Like your latestDate thought, I determine one year after the most recent treatment and count the number of observations in that window.
sort group date
gen yrafter = (date + 365) if treatment == 1
by group: carryforward yrafter, replace
format yrafter %td
gen in_window = date <= yrafter & treatment == 0
egen answer = sum(in_window), by(group yrafter)
replace answer = . if treatment == 0
I can't promise this will be faster than a loop but I suspect that it will be.
The question is not completely clear.
Consider the following data with two different results, num2 and num3:
+-----------------------------------------+
| date2 group treat num2 num3 |
|-----------------------------------------|
| 01feb2000 1 1 3 2 |
| 01jun2000 1 0 . . |
| 01sep2000 1 0 . . |
| 01nov2000 1 1 0 0 |
| 01may2002 1 0 . . |
| 01jan2010 1 1 1 1 |
| 01may2010 1 0 . . |
|-----------------------------------------|
| 01jan2001 2 1 0 0 |
| 01mar2002 2 1 0 0 |
+-----------------------------------------+
The variable num2 is computed assuming you are interested in counting all observations that are within a one-year period after a treated observation (treat == 1), be those observations equal to 0 or 1 for treat. For example, after 01feb2000, there are three observations that comply with the time span condition; two have treat==0 and one has treat == 1, and they are all counted.
The variable num3 is also counting observations that are within a one-year period after a treated observation, but only the cases for which treat == 0.
num2 is computed with code in the spirit of the article you have cited. The use of in makes the run more efficient and there is no gsort (as in your code), which is quite slow. I have assumed that in each group there are no repeated dates:
clear
set more off
input ///
group str15 date count treat num
1 01.02.2000 1 1 2
1 01.06.2000 2 0 .
1 01.09.2000 3 0 .
1 01.11.2000 3 1 .
1 01.05.2002 4 0 .
1 01.01.2010 5 1 1
1 01.05.2010 6 0 .
2 01.01.2001 1 1 0
2 01.03.2002 2 1 0
end
list
gen date2 = date(date,"DMY")
format date2 %td
drop date count num
order date
list, sepby(group)
*----- what you want -----
gen num2 = .
isid group date, sort
forvalues j = 1/`=_N' {
count in `j'/L if inrange(date2 - date2[`j'], 1, 365) & group == group[`j']
replace num2 = r(N) in `j'
}
replace num2 = . if !treat
list, sepby(group)
num3 is computed with code similar in spirit (and results) as that posted by #jfeigenbaum:
<snip>
*----- what you want -----
isid group date, sort
by group: gen indicat = sum(treat)
sort group indicat, stable
by group indicat: egen num3 = total(inrange(date2 - date2[1], 1, 365))
replace num3 = . if !treat
list, sepby(group)
Even more than two interpretations are possible for your problem, but I'll leave it at that.
(Note that I have changed your example data to include cases that probably make the problem more realistic.)
I have a dataset that consists of a series of readings made by different people/instruments, of a bunch of different dimensions. It looks like this:
SUBJECT DIM1_1 DIM1_2 DIM1_3 DIM1_4 DIM1_5 DIM2_1 DIM2_2 DIM2_3 DIM3_1 DIM3_2
1 1 . 1 1 2 3 3 3 2 .
2 1 1 . 1 1 2 2 3 1 1
3 2 2 2 . . 1 . . 5 5
... ... ... ... ... ... ... ... ... ... ...
My real dataset contains around 190 dimensions, with up to 5 measures in each one
I have to obey a set of rules to create a new variable for each dimension:
If there are 2 different values in the same dimension (missings excluded), the new variable is a missing.
If all values are the same (missings excluded), the new variable assumes the same value.
My new variables should look like this:
SUBJECT ... DIM1_X DIM2_X DIM3_X
1 ... . 3 2
2 ... 1 . 1
3 ... 2 1 5
The problem here is that i don't have the same number of measures for each dimension. Also, i could only come up with a lot of IF's (and I mean a LOT, as more measures in a given dimension increases the number of comparisons), so I wonder if there is some easier way to handle this particular problem.
Any help would be apreciated.
Thanks in advance.
Easiest way is to transpose it to vertical (one row per DIMx_y), summarize, then set the ones you want missing to missing, then retranspose (and if needed merge back on).
data have;
input SUBJECT DIM1_1 DIM1_2 DIM1_3 DIM1_4 DIM1_5 DIM2_1 DIM2_2 DIM2_3 DIM3_1 DIM3_2;
datalines;
1 1 . 1 1 2 3 3 3 2 .
2 1 1 . 1 1 2 2 3 1 1
3 2 2 2 . . 1 . . 5 5
;;;;
run;
data have_pret;
set have;
array dim_data DIM:;
do _t = 1 to dim(dim_Data); *dim function is not related to the name - it gives # of vars in array;
dim_Group = scan(vname(dim_data[_t]),1,'_');
dim_num = input(scan(vname(dim_data[_t]),2,'_'),BEST12.);
dim_val=dim_data[_t];
output;
end;
keep dim_group dim_num subject dim_val;
run;
proc freq data=have_pret noprint;
by subject dim_group;
tables dim_val/out=want_pret(where=(not missing(dim_val)));
run;
data want_pret2;
set want_pret;
by subject dim_Group;
if percent ne 100 then dim_val=.;
idval = cats(dim_Group,'_X');
if last.dim_Group;
run;
proc transpose data=want_pret2 out=want;
by subject;
id idval;
var dim_val;
run;