How to overwrite a duplicate observation - stata

I conducted a phone survey and here is the prototype of my dataset:
var1 var2
6666 1
6666 2
7676 2
7676 1
8876 1
8876 2
89898 1
89898 2
9999 1
9999 2
5656 1
5656 2
2323 1
2323 2
9876 1
7654 1
var1 is the unique identifier for each case in my survey (in this case, phone numbers).
var2 is the outcome of the survey: 1 (successful), 2 (not successful).
I want keep the observations for each var1 whose var2 == 1, yet retaining the observations for each var1 whosevar2 == 2 if there is no another case where var2 == 1.
I have tried
duplicates drop var1 if var2 == 2, force
but I am not getting the desired output

The question is wrongly titled: you don't want to overwrite anything.
Your syntax doesn't work as you wish because it is not what you want. You are asking whether there are duplicates of var1 if var2 == 2 and that command pays no attention whatsoever to observations for which var2 == 1.
Your example includes no observations for which var2 == 2 but there is no corresponding observation with var2 == 1. I have added one such.
Here's one way of meeting your goal. I show in passing that the duplicates command you have does nothing for this example; nor would it be expected to do anything.
. clear
. input var1 var2
var1 var2
1. 6666 1
2. 6666 2
3. 7676 2
4. 7676 1
5. 8876 1
6. 8876 2
7. 89898 1
8. 89898 2
9. 9999 1
10. 9999 2
11. 5656 1
12. 5656 2
13. 2323 1
14. 2323 2
15. 9876 1
16. 7654 1
17. 42 2
18. end
. duplicates list var1 if var2 == 2
Duplicates in terms of var1
(0 observations are duplicates)
. bysort var1 (var2) : assert _N == 1 | _N == 2
. by var1 : drop if _n == 2 & var2[2] == 2
(7 observations deleted)
. list, sepby(var1)
+--------------+
| var1 var2 |
|--------------|
1. | 42 2 |
|--------------|
2. | 2323 1 |
|--------------|
3. | 5656 1 |
|--------------|
4. | 6666 1 |
|--------------|
5. | 7654 1 |
|--------------|
6. | 7676 1 |
|--------------|
7. | 8876 1 |
|--------------|
8. | 9876 1 |
|--------------|
9. | 9999 1 |
|--------------|
10. | 89898 1 |
+--------------+
Another way to do it would be
. bysort var1 (var2) : keep if _n == 1 & var2[2] == 2
In fact
. bysort var1 (var2): keep if _n == 1
keeps observations with var2 == 1 if there are any and otherwise will also keep singletons with var2 == 2.
The hidden assumptions seem to include at most two observations for each distinct var1. Note the use of assert for checking assumptions about the dataset.

Related

Browse all the rows and columns that contain a zero

Suppose I have 100 variables named ID, var1, var2, ..., var99. I have 1000 rows. I want to browse all the rows and columns that contain a 0.
I wanted to just do this:
browse ID, var* if var* == 0
but it doesn't work. I don't want to hardcode all 99 variables obviously.
I wanted to essentially write an if like this:
gen has0 = 0
forvalues n = 1/99 {
if var`n' does not contain 0 {
drop v
} // pseudocode I know doesn't work
has0 = has0 | var`n' == 0
}
browse if has0 == 1
but obviously that doesn't work.
Do I just need to reshape the data so it has 2 columns ID, var with 100,000 rows total?
My dear colleague #NickCox forces me to reply to this (duplicate) question because he is claiming that downloading, installing and running a new command is better than using built-in ones when you "need to select from 99 variables".
Consider the following toy example:
clear
input var1 var2 var3 var4 var5
1 4 9 5 0
1 8 6 3 7
0 6 5 6 8
4 5 1 8 3
2 1 0 2 1
4 6 7 1 9
end
list
+----------------------------------+
| var1 var2 var3 var4 var5 |
|----------------------------------|
1. | 1 4 9 5 0 |
2. | 1 8 6 3 7 |
3. | 0 6 5 6 8 |
4. | 4 5 1 8 3 |
5. | 2 1 0 2 1 |
6. | 4 6 7 1 9 |
+----------------------------------+
Actually you don't have to download anything:
preserve
generate obsno = _n
reshape long var, i(obsno)
rename var value
generate var = "var" + string(_j)
list var obsno value if value == 0, noobs
+----------------------+
| var obsno value |
|----------------------|
| var5 1 0 |
| var1 3 0 |
| var3 5 0 |
+----------------------+
levelsof var if value == 0, local(selectedvars) clean
display "`selectedvars'"
var1 var3 var5
restore
This is the approach i recommended in the linked question for identifying negative values. Using levelsof one can do the same thing with findname using a built-in command.
This solution can also be adapted for browse:
preserve
generate obsno = _n
reshape long var, i(obsno)
rename var value
generate var = "var" + string(_j)
browse var obsno value if value == 0
levelsof var if value == 0, local(selectedvars) clean
display "`selectedvars'"
pause
restore
Although i do not see why one would want to browse the results when can simply list them.
EDIT:
Here's an example more closely resembling the OP's dataset:
clear
set seed 12345
set obs 1000
generate id = int((_n - 1) / 300) + 1
forvalues i = 1 / 100 {
generate var`i' = rnormal(0, 150)
}
ds var*
foreach var in `r(varlist)' {
generate rr = runiform()
replace `var' = 0 if rr < 0.0001
drop rr
}
Applying the above solution yields:
display "`selectedvars'"
var13 var19 var35 var36 var42 var86 var88 var90
list id var obsno value if value == 0, noobs sepby(id)
+----------------------------+
| id var obsno value |
|----------------------------|
| 1 var86 18 0 |
| 1 var19 167 0 |
| 1 var13 226 0 |
|----------------------------|
| 2 var88 351 0 |
| 2 var36 361 0 |
| 2 var35 401 0 |
|----------------------------|
| 3 var42 628 0 |
| 3 var90 643 0 |
+----------------------------+
Short answer: wildcards for bunches of variables can't be inserted in if qualifiers. (The if command is different from the if qualifier.)
Your question is contradictory on what you want. At one point your pseudocode has you dropping variables! drop has a clear, destructive meaning to Stata programmers: it doesn't mean "ignore".
But let's stick to the emphasis on browse.
findname, any(# == 0)
finds variables for which any value is 0. search findname, sj to find the latest downloadable version.
Note also that
findname, type(numeric)
will return the numeric variables in r(varlist) (and also a local macro if you so specify).
Then several egen functions compete for finding 0s in each observation for a specified varlist: the command findname evidently helps you identify which varlist.
Let's create a small sandbox to show technique:
clear
set obs 5
gen ID = _n
forval j = 1/5 {
gen var`j' = 1
}
replace var2 = 0 in 2
replace var3 = 0 in 3
list
findname var*, any(# == 0) local(which)
egen zero = anymatch(`which'), value(0)
list `which' if zero
+-------------+
| var2 var3 |
|-------------|
2. | 0 1 |
3. | 1 0 |
+-------------+
So, the problem is split into two: finding the observations with any zeros and finding the observations with any zeros, and then putting the information together.
Naturally, the use of findname is dispensable as you can just write your own loop to identify the variables of interest:
local wanted
quietly foreach v of var var* {
count if `v' == 0
if r(N) > 0 local wanted `wanted' `v'
}
Equally naturally, you can browse as well as list: the difference is just in the command name.

Save duplicates by id

I have two variables in Stata, id and price:
id price
1 4321
1 7634
1 7974
1 7634
1 3244
2 5943
2 3294
2 5645
2 3564
2 4321
2 4567
2 4567
2 4567
2 4567
3 5652
3 9586
3 5844
3 8684
3 2456
4 7634
Usually I can use the duplicates command to get the duplicate observations of a variable.
However, how can I create a new variable that will save the duplicates
of price for each id?
There is no reason that I can see for duplicates to work with by:. duplicates whatever price id is the general recipe with your example, to examine duplicates jointly for two variables. Consider
clear
input id price
1 4321
1 7634
1 7974
1 7634
1 3244
2 5943
2 3294
2 5645
2 3564
2 4321
2 4567
2 4567
2 4567
2 4567
3 5652
3 9586
3 5844
3 8684
3 2456
4 7634
end
. duplicates example id price
Duplicates in terms of id price
+------------------------------------+
| group: # e.g. obs id price |
|------------------------------------|
| 1 2 2 1 7634 |
| 2 4 11 2 4567 |
+------------------------------------+
. duplicates tag id price, gen(tag)
Duplicates in terms of id price
. list id price if tag , sepby(id)
+------------+
| id price |
|------------|
2. | 1 7634 |
4. | 1 7634 |
|------------|
11. | 2 4567 |
12. | 2 4567 |
13. | 2 4567 |
14. | 2 4567 |
+------------+
Beyond that, I am not clear exactly what output or data result you wish to see.
EDIT In response to comment, here are two more direct approaches. duplicates is based on the idea that duplicates are mostly unwanted; you seem to have the opposite point of view, in which case duplicates is oblique to your wants.
* approach 1
bysort price id : gen wanted = _n == 1 & _N > 1
list if wanted
+---------------------+
| id price wanted |
|---------------------|
7. | 2 4567 1 |
15. | 1 7634 1 |
+---------------------+
* approach 2
drop wanted
bysort price id : keep if _n == 1 & _N > 1
list
+------------+
| id price |
|------------|
1. | 2 4567 |
2. | 1 7634 |
+------------+
Naturally if you want to duplicate data yet further (why?) then after approach 1
gen duplicated_price = price if wanted
gives you one copy of each of the duplicated values in a new variable. This is a slightly simpler equivalent of #Pearly Spencer's approach.
bysort price id : gen duplicated_price = price if _n == 1 & _N > 1
does it in one line.

How can I list negative values across my dataset?

I am using a dataset with about 100 variables and 1000 rows, similar to the one below:
. var1 var2 var3 var4
AL 10 11 12 13
AK -1 0 0 18
AZ 5 -5 -2 22
VA 15 16 0 0
How can I list the variables / observations that have a negative value?
For example, I would like to list that AK has negative var1 and AZ has negative var2 and var3.
Here's an example of how you can create a marker variable for each of your var variables:
clear
input str2 state var1 var2 var3 var4
AL 10 11 12 13
AK -1 0 0 18
AZ 5 -5 -2 22
VA 15 16 0 0
end
foreach var in var1 var2 var3 var4 {
generate tag_`var' = `var' < 0
}
list
+-------------------------------------------------------------------------------+
| state var1 var2 var3 var4 tag_var1 tag_var2 tag_var3 tag_var4 |
|-------------------------------------------------------------------------------|
1. | AL 10 11 12 13 0 0 0 0 |
2. | AK -1 0 0 18 1 0 0 0 |
3. | AZ 5 -5 -2 22 0 1 1 0 |
4. | VA 15 16 0 0 0 0 0 0 |
+-------------------------------------------------------------------------------+
You can then do the following:
list state var1 if tag_var1 == 1
+--------------+
| state var1 |
|--------------|
2. | AK -1 |
+--------------+
or
list state var* if tag_var1 == 1 | tag_var2 == 1 | tag_var3 == 1 | tag_var4 == 1
+-----------------------------------+
| state var1 var2 var3 var4 |
|-----------------------------------|
2. | AK -1 0 0 18 |
3. | AZ 5 -5 -2 22 |
+-----------------------------------+
If you do not need the extra flexibility of a marker variable you can simply do:
list state var1 if var1 < 0
EDIT:
Alternatively you could do the following:
preserve
generate obsno = _n
reshape long var, i(obsno)
rename var value
generate var = "var" + string(_j)
list state var obsno value if value < 0, noobs sepby(state)
+------------------------------+
| state var obsno value |
|------------------------------|
| AK var1 2 -1 |
|------------------------------|
| AZ var2 3 -5 |
| AZ var3 3 -2 |
+------------------------------+
restore
There are two other techniques that can be mentioned. One is to calculate the minimum in each observation (row) and then list if and only if that minimum is negative. That way, you get any zeros, positives and missings too in the same observations.
The other is just to loop over the variables and list separately.
clear
input str2 state var1 var2 var3 var4
AL 10 11 12 13
AK -1 0 0 18
AZ 5 -5 -2 22
VA 15 16 0 0
end
egen min = rowmin(var*)
list if min < 0
+-----------------------------------------+
| state var1 var2 var3 var4 min |
|-----------------------------------------|
2. | AK -1 0 0 18 -1 |
3. | AZ 5 -5 -2 22 -5 |
+-----------------------------------------+
foreach v of var var* {
quietly count if `v' < 0
if r(N) list `v' if `v' < 0
}
+------+
| var1 |
|------|
2. | -1 |
+------+
+------+
| var2 |
|------|
3. | -5 |
+------+
+------+
| var3 |
|------|
3. | -2 |
+------+

Combine overlapping categorical variables

I am trying to "combine" two categorical variables in Stata (say var1 and var2) into a new (also categorical) variable (say res).
The example below illustrates what I am trying to achieve:
var1 var2 res
1 1 A
1 2 A
2 1 A
3 3 B
4 2 A
5 4 D
What this example does is to combine all categories of var1 and var2 that "overlap".
Here, the pair var1 == 1 and var2 == 1 initially form a group (res== A). All other pairs containing var1 == 1 or var2 == 1 should belong to the same group (hence res== A in rows 2 and 3). Because in row 2 we have var2==2, any pair with containing var2==2 should belong to the same group. That's why in row 4 res== A.
Another way to look at this problem is using the following matrix:
| 1 2 3 4
-----------------------
1 | 1 1
2 | 1
3 | 1
4 | 1
5 | 1
Because the element [1,1] is not empty (or zero), all elements in row 1 and column 1 must belong to the same group. Because [1,2] is not empty, the same is true for row 1, column 2. And so on and so forth. It does not matter which row/column you decide to start from.
egen group alone doesn't cut it.
Any ideas?
Sounds like you want to further group var1 if the values of var2 are the same. If that's the case, then you can use a program I wrote called group_id that's available from SSC. To install it, type in Stata's Command window:
ssc install group_id
Here's an example of how you would use it:
* Example generated by -dataex-. To install: ssc install dataex
clear
input float(var1 var2) str1 res
1 1 "A"
1 2 "A"
2 1 "A"
3 3 "B"
4 2 "A"
5 4 "D"
end
gen long wanted = var1
group_id wanted, matchby(var2)
list, sep(0)
and the results:
. list, sep(0)
+----------------------------+
| var1 var2 res wanted |
|----------------------------|
1. | 1 1 A 1 |
2. | 1 2 A 1 |
3. | 2 1 A 1 |
4. | 3 3 B 3 |
5. | 4 2 A 1 |
6. | 5 4 D 5 |
+----------------------------+

Stata: How to count the number of 'active' cases in a group when new case is opened?

I'm relatively new to Stata and am trying to count the number of active cases an employee has open over time in my dataset (see link below for example). I tried writing a loop using forvalues based on an example I found online, but keep getting
invalid syntax
For each EmpID I want to count the number of cases that employee had open when a new case was added to the queue. So if a case is added with an OpenDate of 03/15/2015 and the EmpID has two other cases open at the time, the code would assign a value of 2 to NumActiveWhenOpened field. A case is considered active if (1) its OpenDate is less then the new case's OpenDate & (2) its CloseDate is greater than the new case's OpenDate.
The link below provides an example. I'm trying to write a loop that creates the NumActiveWhenOpened column. Any help would be greatly appreciated. Thanks!
http://i.stack.imgur.com/z4iyR.jpg
EDIT
Here is the code that is not working. I'm sure there are several things wrong with it and I'm not sure how to store the count in the [NumActiveWhenOpen] field.
by EmpID: generate CaseNum = _n
egen group = group(EmpID)
su group, meanonly
gen NumActiveWhenOpen = 0
forvalues i = 1/ 'r(max)' {
forvalues x = 1/CaseNum if group == `i'{
count if OpenDate[_n] > OpenDate[_n-x] & CloseDate[_n-x] > OpenDate[_n]
}
}
This sounds like a problem discussed in http://www.stata-journal.com/article.html?article=dm0068 but let's try to be self-contained. I am not sure that I understand the definitions, but this may help.
I'll steal part of Roberto Ferrer's sandbox.
clear
set more off
input ///
caseid str15(open close) empid
1 "1/1/2010" "3/1/2010" 1
2 "2/5/2010" "" 1
3 "2/15/2010" "4/7/2010" 1
4 "3/5/2010" "" 1
5 "3/15/2010" "6/15/2010" 1
6 "3/24/2010" "3/24/2010" 1
1 "1/1/2010" "3/1/2010" 2
2 "2/5/2010" "" 2
3 "2/15/2010" "4/7/2010" 2
4 "3/5/2010" "" 2
5 "3/15/2010" "6/15/2010" 2
end
gen d1 = date(open, "MDY")
gen d2 = date(close, "MDY")
format %td d1 d2
drop open close
reshape long d, i(empid caseid) j(status)
replace status = -1 if status == 2
replace status = . if missing(d)
bysort empid (d) : gen nopen = sum(status)
bysort empid d : replace nopen = nopen[_N]
l
The idea is to reshape so that each pair of dates becomes two observations. Then if we code each opening by 1 and each closing by -1 the total number of active cases is their cumulative sum. That's all. Here are the results:
. l, sepby(empid)
+---------------------------------------------+
| empid caseid status d nopen |
|---------------------------------------------|
1. | 1 1 1 01jan2010 1 |
2. | 1 2 1 05feb2010 2 |
3. | 1 3 1 15feb2010 3 |
4. | 1 1 -1 01mar2010 2 |
5. | 1 4 1 05mar2010 3 |
6. | 1 5 1 15mar2010 4 |
7. | 1 6 1 24mar2010 4 |
8. | 1 6 -1 24mar2010 4 |
9. | 1 3 -1 07apr2010 3 |
10. | 1 5 -1 15jun2010 2 |
11. | 1 2 . . 2 |
12. | 1 4 . . 2 |
|---------------------------------------------|
13. | 2 1 1 01jan2010 1 |
14. | 2 2 1 05feb2010 2 |
15. | 2 3 1 15feb2010 3 |
16. | 2 1 -1 01mar2010 2 |
17. | 2 4 1 05mar2010 3 |
18. | 2 5 1 15mar2010 4 |
19. | 2 3 -1 07apr2010 3 |
20. | 2 5 -1 15jun2010 2 |
21. | 2 4 . . 2 |
22. | 2 2 . . 2 |
+---------------------------------------------+
The bottom line is no loops needed, but by: helps mightily. A detail useful here is that the cumulative sum function sum() ignores missings.
Try something along the lines of
clear
set more off
*----- example data -----
input ///
caseid str15(open close) empid numact
1 "1/1/2010" "3/1/2010" 1 0
2 "2/5/2010" "" 1 1
3 "2/15/2010" "4/7/2010" 1 2
4 "3/5/2010" "" 1 2
5 "3/15/2010" "6/15/2010" 1 3
6 "3/24/2010" "3/24/2010" 1 .
1 "1/1/2010" "3/1/2010" 2 0
2 "2/5/2010" "" 2 1
3 "2/15/2010" "4/7/2010" 2 2
4 "3/5/2010" "" 2 2
5 "3/15/2010" "6/15/2010" 2 3
end
gen opend = date(open, "MDY")
gen closed = date(close, "MDY")
format %td opend closed
drop open close
order empid
list, sepby(empid)
*----- what you want -----
gen numact2 = .
sort empid caseid
forvalues i = 1/`=_N' {
count if empid[`i'] == empid & /// a different count for each employee
opend[`i'] <= closed /// the date condition
in 1/`i' // no need to look at cases that have not yet occurred
replace numact2 = r(N) - 1 in `i'
}
list, sepby(empid)
This is resource intensive so if you have a large data set, it will take some time. The reason is it loops over observations checking conditions. See help stored results and help return for an explanation of r(N).
A good read is
Stata tip 51: Events in intervals, The Stata Journal, by Nicholas J. Cox.
Note how I provided an example data set within the code (see help input). That is how I recommend you do it for future questions. This will save other people's time and increase the probabilities of you getting an answer.