calculating durations across variables - stata

My data in Stata is like:
day1 day1_dt day2 day2_dt day3 day3_dt day4 day4_dt
0 2010-01-02 4 2010-01-03 . 2010-01-04 2 2010-01-05
. 2011-05-02 3 2011-05-03 4 2011-05-04 4 2011-05-05
5 2012-01-05 4 2012-01-06 4 2012-01-07 4 2012-01-08
4 2015-05-02 4 2015-05-03 4 2015-05-04 4 2015-05-05
1 2009-05-02 4 2009-05-03 0 2009-05-04 4 2009-05-05
I want to calculate the following:
Duration in days when dayX variable has 4 in them.
I think I solved the first part using following program
generate int flg1 =1 if day1 == 4
generate int flg2 =1 if day2 == 4
generate int flg3 =1 if day3 == 4
generate int flg4 =1 if day4 == 4
egen duration = rowtotal(flg*)
Identify the date where the value of 4 is no more/has changed and record it in end_date
The final data would look like:
day1 day1_dt day2 day2_dt day3 day3_dt day4 day4_dt duration end_date
0 2010-01-02 4 2010-01-03 . 2010-01-04 2 2010-01-05 1 2010-01-04
. 2011-05-02 3 2011-05-03 4 2011-05-04 4 2011-05-05 2 .
5 2012-01-05 4 2012-01-06 4 2012-01-07 4 2012-01-08 3 .
4 2015-05-02 4 2015-05-03 4 2015-05-04 4 2015-05-05 4 .
1 2009-05-02 4 2009-05-03 0 2009-05-04 4 2009-05-05 2 .

There seems to be a typo in the second row in your final example. If not, then please explain why you want duration to be 1 and not 2 there.
If it was a typo, then this is the simplest way to do it. Note that it is only the last line of code that is the answer to your question.
// This is best practice way of sharing data examples in Stata on StackOverflow
* Example generated by -dataex-. For more info, type help dataex
clear
input byte day1 int day1_dt byte day2 int day2_dt byte day3 int day3_dt byte day4 int day4_dt
0 18264 4 18265 . 18266 2 18267
. 18749 3 18750 4 18751 4 18752
5 18997 4 18998 4 18999 4 19000
4 20210 4 20211 4 20212 4 20213
1 18019 4 18020 0 18021 4 18022
end
format %tdnn/dd/CCYY day1_dt
format %tdnn/dd/CCYY day2_dt
format %tdnn/dd/CCYY day3_dt
format %tdnn/dd/CCYY day4_dt
// This is your solution
* Count number of day1, day2 etc vars with value 4
egen duration = anycount(day?), values(4)

From a Stata point of view you seem to be holding panel or longitudinal data in a wide layout. As you are finding out, that makes even simple tasks quite complicated. I recommend changing to long layout with a reshape.
See the Stata tag wiki for how to give data examples as Stata code (short explanation: use the dataex command). Your example is fairly clear but requires a guess at what kind of date you have, as the display format could be YMD or YDM. I guessed one way, but the principles are the same the other way. If your date variables are really strings, you need to push them through daily() to do anything useful.
Script and output follow. You'll want to assign a display format to enddate as well.
* Example generated by -dataex-. For more info, type help dataex
clear
input byte(day1 day2 day3 day4) float(day1_dt day2_dt day3_dt day4_dt)
0 4 . 2 18264 18265 18266 18267
. 3 4 4 18749 18750 18751 18752
5 4 4 4 18997 18998 18999 19000
4 4 4 4 20210 20211 20212 20213
1 4 0 4 18019 18020 18021 18022
end
format %tdCY-N-D day1_dt
format %tdCY-N-D day2_dt
format %tdCY-N-D day3_dt
format %tdCY-N-D day4_dt
gen long id = _n
reshape long day day#_dt, i(id)
egen duration = total(day == 4), by(id)
egen enddate = max(cond(day == 4, day_dt, .)), by(id)
egen whenlast = max(day_dt), by(id)
replace enddate = . if enddate == whenlast
list, sepby(id)
. * Example generated by -dataex-. For more info, type help dataex
. clear
. input byte(day1 day2 day3 day4) float(day1_dt day2_dt day3_dt day4_dt)
day1 day2 day3 day4 day1_dt day2_dt day3_dt day4_dt
1. 0 4 . 2 18264 18265 18266 18267
2. . 3 4 4 18749 18750 18751 18752
3. 5 4 4 4 18997 18998 18999 19000
4. 4 4 4 4 20210 20211 20212 20213
5. 1 4 0 4 18019 18020 18021 18022
6. end
. format %tdCY-N-D day1_dt
. format %tdCY-N-D day2_dt
. format %tdCY-N-D day3_dt
. format %tdCY-N-D day4_dt
.
. gen long id = _n
. reshape long day day#_dt, i(id)
(j = 1 2 3 4)
Data Wide -> Long
-----------------------------------------------------------------------------
Number of observations 5 -> 20
Number of variables 9 -> 4
j variable (4 values) -> _j
xij variables:
day1 day2 ... day4 -> day
day1_dt day2_dt ... day4_dt -> day_dt
-----------------------------------------------------------------------------
.
. egen duration = total(day == 4), by(id)
.
. egen enddate = max(cond(day == 4, day_dt, .)), by(id)
. egen whenlast = max(day_dt), by(id)
. replace enddate = . if enddate == whenlast
(16 real changes made, 16 to missing)
.
. list, sepby(id)
+------------------------------------------------------------+
| id _j day day_dt duration enddate whenlast |
|------------------------------------------------------------|
1. | 1 1 0 2010-01-02 1 18265 18267 |
2. | 1 2 4 2010-01-03 1 18265 18267 |
3. | 1 3 . 2010-01-04 1 18265 18267 |
4. | 1 4 2 2010-01-05 1 18265 18267 |
|------------------------------------------------------------|
5. | 2 1 . 2011-05-02 2 . 18752 |
6. | 2 2 3 2011-05-03 2 . 18752 |
7. | 2 3 4 2011-05-04 2 . 18752 |
8. | 2 4 4 2011-05-05 2 . 18752 |
|------------------------------------------------------------|
9. | 3 1 5 2012-01-05 3 . 19000 |
10. | 3 2 4 2012-01-06 3 . 19000 |
11. | 3 3 4 2012-01-07 3 . 19000 |
12. | 3 4 4 2012-01-08 3 . 19000 |
|------------------------------------------------------------|
13. | 4 1 4 2015-05-02 4 . 20213 |
14. | 4 2 4 2015-05-03 4 . 20213 |
15. | 4 3 4 2015-05-04 4 . 20213 |
16. | 4 4 4 2015-05-05 4 . 20213 |
|------------------------------------------------------------|
17. | 5 1 1 2009-05-02 2 . 18022 |
18. | 5 2 4 2009-05-03 2 . 18022 |
19. | 5 3 0 2009-05-04 2 . 18022 |
20. | 5 4 4 2009-05-05 2 . 18022 |
+------------------------------------------------------------+

This is a sequel to the answer by #TheIceBear showing how to answer Question 2 while keeping the same layout.
* Example generated by -dataex-. For more info, type help dataex
clear
input byte(day1 day2 day3 day4) float(day1_dt day2_dt day3_dt day4_dt)
0 4 . 2 18264 18265 18266 18267
. 3 4 4 18749 18750 18751 18752
5 4 4 4 18997 18998 18999 19000
4 4 4 4 20210 20211 20212 20213
1 4 0 4 18019 18020 18021 18022
end
format %tdCY-N-D day1_dt
format %tdCY-N-D day2_dt
format %tdCY-N-D day3_dt
format %tdCY-N-D day4_dt
gen enddate = .
* 1/4 is contingent on day*_dt running over 1 to 4
* and on those variables being in date order
forval j = 1/4 {
replace enddate = day`j'_dt if day`j' == 4
}
egen whenlast = rowmax(day*_dt)
replace enddate = . if enddate == whenlast
format enddate whenlast %td
list enddate whenlast

Related

How to recode separate variables from a multiple response survey question into one variable

I am trying to recode a variable that indicates total number of responses to a multiple response survey question. Question 4 has options 1, 2, 3, 4, 5, 6, and participants may choose one or more options when submitting a response. The data is currently coded as binary outputs for each option: var Q4___1 = yes or no (1/0), var Q4___2 = yes or no (1/0), and so forth.
This is the tabstat of all yes (1) responses to the 6 Q4___* variables
Variable | Sum
-------------+----------
q4___1 | 63
q4___2 | 33
q4___3 | 7
q4___4 | 2
q4___5 | 3
q4___6 | 7
------------------------
total = 115
I would like to create a new variable that encapsulates these values.
Can someone help me figure out how to create this variable, and if coding a variable in this manner for a multiple option survey question is valid?
When I used the replace command the total number of responses were not adding up, as shown below
gen q4=.
replace q4 =1 if q4___1 == 1
replace q4 =2 if q4___2 == 1
replace q4 =3 if q4___3 == 1
replace q4 =4 if q4___4 == 1
replace q4 =5 if q4___5 == 1
replace q4 =6 if q4___6 == 1
label values q4 primarysource`
q4 | Freq. Percent Cum.
------------+-----------------------------------
1 | 46 48.94 48.94
2 | 31 32.98 81.91
3 | 6 6.38 88.30
4 | 1 1.06 89.36
5 | 3 3.19 92.55
6 | 7 7.45 100.00
------------+-----------------------------------
Total | 94 100.00
UPDATE
to specify I am trying to create a new variable that captures the column sum of each question, not the rowtotal across all questions. I know that 63 participants responded yes to question 4 a) and 33 to question 4 b) so I want my new variable to reflect that.
This is what I want my new variable's values to look like.
q4
-------------+----------
q4___1 | 63
q4___2 | 33
q4___3 | 7
q4___4 | 2
q4___5 | 3
q4___6 | 7
------------------------
total = 115
The fallacy here is ignoring the possibility of multiple 1s as answers to the various Q4???? variables. For example if someone answers 1 1 1 1 1 1 to all questions, they appear in your final variable only in respect of their answer to the 6th question. Otherwise put, your code overwrites and so ignores all positive answers before the last positive answer.
What is likely to be more useful are
(1) the total across all 6 questions which is just
egen Q4_total = rowtotal(Q4????)
where the 4 instances of ? mean that by eye I count 3 underscores and 1 numeral.
(2) a concatenation of responses that is just
egen Q4_concat = concat(Q4????)
(3) a variable that is a concatenation of questions with positive responses, so 246 if those questions were answered 1 and the others were answered 0.
gen Q4_pos = ""
forval j = 1/6 {
replace Q4_pos = Q4_pos + "`j'" if Q4____`j' == 1
}
EDIT
Here is a test script giving concrete examples.
clear
set obs 6
forval j = 1/6 {
gen Q`j' = _n <= `j'
}
list
egen rowtotal = rowtotal(Q?)
su rowtotal, meanonly
di r(sum)
* install from tab_chi on SSC
tabm Q?
Results:
. list
+-----------------------------+
| Q1 Q2 Q3 Q4 Q5 Q6 |
|-----------------------------|
1. | 1 1 1 1 1 1 |
2. | 0 1 1 1 1 1 |
3. | 0 0 1 1 1 1 |
4. | 0 0 0 1 1 1 |
5. | 0 0 0 0 1 1 |
|-----------------------------|
6. | 0 0 0 0 0 1 |
+-----------------------------+
. egen rowtotal = rowtotal(Q?)
. su rowtotal, meanonly
. di r(sum)
21
. tabm Q?
| values
variable | 0 1 | Total
-----------+----------------------+----------
Q1 | 5 1 | 6
Q2 | 4 2 | 6
Q3 | 3 3 | 6
Q4 | 2 4 | 6
Q5 | 1 5 | 6
Q6 | 0 6 | 6
-----------+----------------------+----------
Total | 15 21 | 36

Generating variable by groups taking values of certain observations

I have a dataset with only variable values:
clear
input value new_var
1 1
3 3
5 5
30 1
40 3
50 5
11 1
12 3
13 5
end
How can I generate a new variable new_var containing a repeating sequence of the first three observations in value?
Many ways to do it: here are two:
clear
input value new_var
1 1
3 3
5 5
30 1
40 3
50 5
11 1
12 3
13 5
end
egen index = seq(), to(3)
generate wanted = value[index]
generate direct = cond(mod(_n, 3) == 1, 1, cond(mod(_n, 3) == 2, 3, 5))
list, sep(3)
+-------------------------------------------+
| value new_var index wanted direct |
|-------------------------------------------|
1. | 1 1 1 1 1 |
2. | 3 3 2 3 3 |
3. | 5 5 3 5 5 |
|-------------------------------------------|
4. | 30 1 1 1 1 |
5. | 40 3 2 3 3 |
6. | 50 5 3 5 5 |
|-------------------------------------------|
7. | 11 1 1 1 1 |
8. | 12 3 2 3 3 |
9. | 13 5 3 5 5 |
+-------------------------------------------+

Adding observations between rows

I would like to create new observations as follows:
A B C
1 1 1
1 2 2
1 3 4
1 4 5
1 5 2
2 1 1
2 2 5
2 3 3
2 4 3
*3* 1 .
*3* 2 .
*3* 3 .
*3* 4 .
*3* 5 .
4 1 4
4 2 3
4 3 1
The new lines are indicated by asterisks.
How can I create new observations for variable A and B?
This is a simple expand:
clear
input A B C
1 1 1
1 2 2
1 3 4
1 4 5
1 5 2
2 1 1
2 2 5
2 3 3
2 4 3
4 1 4
4 2 3
4 3 1
end
generate id = _n
expand 6 if id == 10
replace id = 11 if _n == _N
replace A = 3 if id == 10
replace C = . if id == 10
bysort id: replace B = cond(_n == 1, 1, B[_n-1]+1) if id == 10
Which will produce the desired output:
list, sepby(A)
+----------------+
| A B C id |
|----------------|
1. | 1 1 1 1 |
2. | 1 2 2 2 |
3. | 1 3 4 3 |
4. | 1 4 5 4 |
5. | 1 5 2 5 |
|----------------|
6. | 2 1 1 6 |
7. | 2 2 5 7 |
8. | 2 3 3 8 |
9. | 2 4 3 9 |
|----------------|
10. | 3 1 . 10 |
11. | 3 2 . 10 |
12. | 3 3 . 10 |
13. | 3 4 . 10 |
14. | 3 5 . 10 |
|----------------|
15. | 4 1 4 11 |
16. | 4 2 3 11 |
17. | 4 3 1 12 |
+----------------+
The code could be shorter.
expand 2 if _n < 6
replace A = 3 if _n > _N - 5
*replace B = _n + 5 - _N if A == 3
replace C = . if A == 3
sort A B

Populating new variable using vlookup with multiple criteria in another variable

1) A new variable should be created for each unique observation listed in variable sku, which contains repeated values.
2) These newly created variables should be assigned the value of own product's price at the store/week level, as long as observations' sku value is in the same subcategory (subc) as the variable itself. For example, in eta2,3, observations in line 3, 4, and 5 have the same value because they all belong to the same subcategory as sku #3. [eta2,3 indicates sku 3, subc 2.]
3) x indicates that this is the original value for the product/subcategory that is currently being replicated.
4) If an observation doesn't belong to the same subcategory, it should reflect "0".
Orange is the given data. In green are the values from the steps 1, 2, and 3. White cells are step 4.
I am unable to offer a solution of my own, as searching for a
way to generate a variable using existing observations hasn't given me results.
I also understand that it must be a combination of forvalues, foreach, and levelsof commands?
clear
input units price sku week store subc
3 4.3 1 1 1 1
2 3 2 1 1 1
1 2.5 3 1 1 2
4 12 5 1 1 2
5 12 6 1 1 3
35 4.3 1 1 2 1
23 3 2 1 2 1
12 2.5 3 1 2 2
35 12 5 1 2 2
35 12 6 1 2 3
3 20 1 2 1 1
2 30 2 2 1 1
4 40 3 2 2 2
1 50 4 2 2 2
9 10 5 2 2 2
2 90 6 2 2 3
end
UPDATE
Based on Nick Cox' feedback, this is the final code that gives the result I have been looking for:
clear
input units price sku week store subc
35 4.3 1 1 1 1
23 3 2 1 1 1
12 2.5 3 1 1 2
10 1 4 1 1 2
35 12 5 1 1 2
35 12 6 1 1 3
35 5.3 1 2 1 1
23 4 2 2 1 1
12 3.5 3 2 1 2
10 2 4 2 1 2
35 13 5 2 1 2
35 13 6 2 1 3
end
egen joint = group(subc sku), label
bysort store week : gen freq = _N
su freq, meanonly
local jmax = r(max)
drop freq
tostring subc sku, replace
gen new = subc + "_"+sku
su joint, meanonly
forval j = 1/`r(max)'{
local J = new[`j']
gen eta`J' = .
}
sort subc week store sku
egen joint1 = group(subc week store), label
gen long id = _n
su joint1, meanonly
quietly forval i = 1/`r(max)' {
su id if joint1 == `i', meanonly
local jmin = r(min)
local jmax = r(max)
forval j = `jmin'/`jmax' {
local subc = subc[`j']
local sku = sku[`j']
replace eta`subc'_`sku' = price[`j'] in `jmin'/`jmax'
replace eta`subc'_`sku' = 0 in `j'/`j'
}
}
I worry on your behalf that in a dataset of any size what you ask for would mean many, many extra variables. I wonder on your behalf whether you need all of them any way for whatever you want to do with them.
That aside, this seems to be what you want. Naturally your column headers in your spreadsheet view aren't legal variable names. Disclosure: despite being the original author of levelsof I wouldn't prefer its use here.
clear
input units price sku week store subc
35 4.3 1 1 1 1
23 3 2 1 1 1
12 2.5 3 1 1 2
10 1 4 1 1 2
35 12 5 1 1 2
35 12 6 1 1 3
end
sort subc sku
* subc identifiers guaranteed to be integers 1 up
egen subc_id = group(subc), label
* observation numbers in a variable
gen long id = _n
* how many subc? loop over the range
su subc_id, meanonly
forval i = 1/`r(max)' {
* which subc is this one? look it up using -summarize-
* assuming that subc is numeric!
su subc if subc_id == `i', meanonly
local I = r(min)
* which observation numbers for this subc?
* given the prior sort, they are all contiguous
su id if subc_id == `i', meanonly
* for each observation in the subc, find out the sku and copy its price
* to all observations in that subc
forval j = `r(min)'/`r(max)' {
local J = sku[`j']
gen eta_`I'_`J' = cond(subc_id == `i', price[`j'], 0)
}
}
list subc eta*, sepby(subc)
+------------------------------------------------------------------+
| subc eta_1_1 eta_1_2 eta_2_3 eta_2_4 eta_2_5 eta_3_6 |
|------------------------------------------------------------------|
1. | 1 4.3 3 0 0 0 0 |
2. | 1 4.3 3 0 0 0 0 |
|------------------------------------------------------------------|
3. | 2 0 0 2.5 1 12 0 |
4. | 2 0 0 2.5 1 12 0 |
5. | 2 0 0 2.5 1 12 0 |
|------------------------------------------------------------------|
6. | 3 0 0 0 0 0 12 |
+------------------------------------------------------------------+
Notes:
N1. In your example, subc is numbered 1, 2, etc. My extra variable subc_id ensures that to be true even if in your real data the identifiers are not so clean.
N2. The expression
cond(subc_id == `i', price[`j'], 0)
could also be
(subc_id == `i') * price[`j']
N3. It seems possible that a different data structure would be much more efficient.
EDIT: Here is code and results for another data structure.
clear
input units price sku week store subc
35 4.3 1 1 1 1
23 3 2 1 1 1
12 2.5 3 1 1 2
10 1 4 1 1 2
35 12 5 1 1 2
35 12 6 1 1 3
end
sort subc sku
egen subc_id = group(subc), label
bysort subc : gen freq = _N
su freq, meanonly
local jmax = r(max)
drop freq
forval j = 1/`jmax' {
gen eta`j' = .
gen which`j' = .
}
gen long id = _n
su subc_id, meanonly
quietly forval i = 1/`r(max)' {
su id if subc_id == `i', meanonly
local jmin = r(min)
local jmax = r(max)
local k = 1
forval j = `jmin'/`jmax' {
replace which`k' = sku[`j'] in `jmin'/`jmax'
replace eta`k' = price[`j'] in `jmin'/`jmax'
local ++k
}
}
list subc sku *1 *2 *3 , sepby(subc)
+------------------------------------------------------------+
| subc sku eta1 which1 eta2 which2 eta3 which3 |
|------------------------------------------------------------|
1. | 1 1 4.3 1 3 2 . . |
2. | 1 2 4.3 1 3 2 . . |
|------------------------------------------------------------|
3. | 2 3 2.5 3 1 4 12 5 |
4. | 2 4 2.5 3 1 4 12 5 |
5. | 2 5 2.5 3 1 4 12 5 |
|------------------------------------------------------------|
6. | 3 6 12 6 . . . . |
+------------------------------------------------------------+
I am adding another answer that tackles combinations of subc and week. Previous discussion establishes that what you are trying to do would add an extra variable for every observation. This can't be a good idea! At best, you might just have many new variables, mostly zeros. At worst, you will run into Stata's limits.
Hence I won't support your endeavour to go further down the same road, but show how the second data structure I discuss in my previous answer can be produced. Indeed, you haven't indicated (a) why you want all these variables, which are just the existing data redistributed; (b) what your strategy is for dealing with them; (c) why rangestat (SSC) or some other program could not remove the need to create them in the first place.
clear
input units price sku week store subc
35 4.3 1 1 1 1
23 3 2 1 1 1
12 2.5 3 1 1 2
10 1 4 1 1 2
35 12 5 1 1 2
35 12 6 1 1 3
35 5.3 1 2 1 1
23 4 2 2 1 1
12 3.5 3 2 1 2
10 2 4 2 1 2
35 13 5 2 1 2
35 13 6 2 1 3
end
sort subc week sku
egen joint = group(subc week), label
bysort joint : gen freq = _N
su freq, meanonly
local jmax = r(max)
drop freq
forval j = 1/`jmax' {
gen eta`j' = .
gen which`j' = .
}
gen long id = _n
su joint, meanonly
quietly forval i = 1/`r(max)' {
su id if joint == `i', meanonly
local jmin = r(min)
local jmax = r(max)
local k = 1
forval j = `jmin'/`jmax' {
replace which`k' = sku[`j'] in `jmin'/`jmax'
replace eta`k' = price[`j'] in `jmin'/`jmax'
local ++k
}
}
list subc week sku *1 *2 *3 , sepby(subc week)
+-------------------------------------------------------------------+
| subc week sku eta1 which1 eta2 which2 eta3 which3 |
|-------------------------------------------------------------------|
1. | 1 1 1 4.3 1 3 2 . . |
2. | 1 1 2 4.3 1 3 2 . . |
|-------------------------------------------------------------------|
3. | 1 2 1 5.3 1 4 2 . . |
4. | 1 2 2 5.3 1 4 2 . . |
|-------------------------------------------------------------------|
5. | 2 1 3 2.5 3 1 4 12 5 |
6. | 2 1 4 2.5 3 1 4 12 5 |
7. | 2 1 5 2.5 3 1 4 12 5 |
|-------------------------------------------------------------------|
8. | 2 2 3 3.5 3 2 4 13 5 |
9. | 2 2 4 3.5 3 2 4 13 5 |
10. | 2 2 5 3.5 3 2 4 13 5 |
|-------------------------------------------------------------------|
11. | 3 1 6 12 6 . . . . |
|-------------------------------------------------------------------|
12. | 3 2 6 13 6 . . . . |
+-------------------------------------------------------------------+
clear
input units price sku week store subc
35 4.3 1 1 1 1
23 3 2 1 1 1
12 2.5 3 1 1 2
10 1 4 1 1 2
35 12 5 1 1 2
35 12 6 1 1 3
35 5.3 1 2 1 1
23 4 2 2 1 1
12 3.5 3 2 1 2
10 2 4 2 1 2
35 13 5 2 1 2
35 13 6 2 1 3
end
egen joint = group(subc sku), label
bysort store week : gen freq = _N
su freq, meanonly
local jmax = r(max)
drop freq
tostring subc sku, replace
gen new = subc + "_"+sku
su joint, meanonly
forval j = 1/`r(max)'{
local J = new[`j']
gen eta`J' = .
}
sort subc week store sku
egen joint1 = group(subc week store), label
gen long id = _n
su joint1, meanonly
quietly forval i = 1/`r(max)' {
su id if joint1 == `i', meanonly
local jmin = r(min)
local jmax = r(max)
forval j = `jmin'/`jmax' {
local subc = subc[`j']
local sku = sku[`j']
replace eta`subc'_`sku' = price[`j'] in `jmin'/`jmax'
replace eta`subc'_`sku' = 0 in `j'/`j'
}
}
list subc sku store week eta*, sepby(subc)
+---------------------------------------------------------------------------------+
| store week subc sku eta1_1 eta1_2 eta2_3 eta2_4 eta2_5 eta3_6 |
|---------------------------------------------------------------------------------|
1. | 1 1 1 2 4.3 0 . . . . |
2. | 1 1 1 1 0 3 . . . . |
|---------------------------------------------------------------------------------|
3. | 1 1 2 4 . . 2.5 0 12 . |
4. | 1 1 2 3 . . 0 1 12 . |
5. | 1 1 2 5 . . 2.5 1 0 . |
|---------------------------------------------------------------------------------|
6. | 1 1 3 6 . . . . . 0 |
|---------------------------------------------------------------------------------|
7. | 1 2 1 2 5.3 0 . . . . |
8. | 1 2 1 1 0 4 . . . . |
|---------------------------------------------------------------------------------|
9. | 1 2 2 3 . . 0 2 13 . |
10. | 1 2 2 5 . . 3.5 2 0 . |
11. | 1 2 2 4 . . 3.5 0 13 . |
|---------------------------------------------------------------------------------|
12. | 1 2 3 6 . . . . . 0 |
+---------------------------------------------------------------------------------+

Looking up data within a file versus merging

I have a file that look at ratings that teacher X gives to teacher Y and the date it occurs
clear
rating_id RatingTeacher RatedTeacher Rating Date
1 15 12 1 "1/1/2010"
2 12 11 2 "1/2/2010"
3 14 11 3 "1/2/2010"
4 14 13 2 "1/5/2010"
5 19 11 4 "1/6/2010"
5 11 13 1 "1/7/2010"
end
I want to look in the history to see how many times the RatingTeacher had been rated at the time they make the rating and the cumulative score. The result would look like this.
rating_id RatingTeacher RatedTeacher Rating Date TimesRated CumulativeRating
1 15 12 1 "1/1/2010" 0 0
2 12 11 2 "1/2/2010" 1 1
3 14 11 3 "1/2/2010" 0 0
4 14 13 2 "1/5/2010" 0 0
5 19 11 4 "1/6/2010" 0 0
5 11 13 1 "1/7/2010" 3 9
end
I have been merging the dataset with itself to get this to work, and it is fine. I was wondering if there was a more efficient way to do this within the file
In your input data, I guess that the last rating_id should be 6 and that dates are MDY. Statalist members are asked to use dataex (SSC) to set up data examples. This isn't Statalist but there is no reason for lower standards to apply. See the Statalist FAQ
I rarely see even programmers be precise about what they mean by "efficient", whether it means fewer lines of code, less use of memory, more speed, something else or is just some all-purpose term of praise. This code loops over observations, which can certainly be slow for large datasets. More in this paper
We can't compare with your merge solution because you don't give the code.
clear
input rating_id RatingTeacher RatedTeacher Rating str8 SDate
1 15 12 1 "1/1/2010"
2 12 11 2 "1/2/2010"
3 14 11 3 "1/2/2010"
4 14 13 2 "1/5/2010"
5 19 11 4 "1/6/2010"
6 11 13 1 "1/7/2010"
end
gen Date = daily(SDate, "MDY")
sort Date
gen Wanted = .
quietly forval i = 1/`=_N' {
count if Date < Date[`i'] & RatedT == RatingT[`i']
replace Wanted = r(N) in `i'
}
list, sep(0)
+---------------------------------------------------------------------+
| rating~d Rating~r RatedT~r Rating SDate Date Wanted |
|---------------------------------------------------------------------|
1. | 1 15 12 1 1/1/2010 18263 0 |
2. | 2 12 11 2 1/2/2010 18264 1 |
3. | 3 14 11 3 1/2/2010 18264 0 |
4. | 4 14 13 2 1/5/2010 18267 0 |
5. | 5 19 11 4 1/6/2010 18268 0 |
6. | 6 11 13 1 1/7/2010 18269 3 |
+---------------------------------------------------------------------+
The building block is that the rater and ratee are a pair. You can use egen's group() to give a unique ID to each rater ratee pair.
egen pair = group(rater ratee)
bysort pair (date): timesRated = _n