Save duplicates by id - stata

I have two variables in Stata, id and price:
id price
1 4321
1 7634
1 7974
1 7634
1 3244
2 5943
2 3294
2 5645
2 3564
2 4321
2 4567
2 4567
2 4567
2 4567
3 5652
3 9586
3 5844
3 8684
3 2456
4 7634
Usually I can use the duplicates command to get the duplicate observations of a variable.
However, how can I create a new variable that will save the duplicates
of price for each id?

There is no reason that I can see for duplicates to work with by:. duplicates whatever price id is the general recipe with your example, to examine duplicates jointly for two variables. Consider
clear
input id price
1 4321
1 7634
1 7974
1 7634
1 3244
2 5943
2 3294
2 5645
2 3564
2 4321
2 4567
2 4567
2 4567
2 4567
3 5652
3 9586
3 5844
3 8684
3 2456
4 7634
end
. duplicates example id price
Duplicates in terms of id price
+------------------------------------+
| group: # e.g. obs id price |
|------------------------------------|
| 1 2 2 1 7634 |
| 2 4 11 2 4567 |
+------------------------------------+
. duplicates tag id price, gen(tag)
Duplicates in terms of id price
. list id price if tag , sepby(id)
+------------+
| id price |
|------------|
2. | 1 7634 |
4. | 1 7634 |
|------------|
11. | 2 4567 |
12. | 2 4567 |
13. | 2 4567 |
14. | 2 4567 |
+------------+
Beyond that, I am not clear exactly what output or data result you wish to see.
EDIT In response to comment, here are two more direct approaches. duplicates is based on the idea that duplicates are mostly unwanted; you seem to have the opposite point of view, in which case duplicates is oblique to your wants.
* approach 1
bysort price id : gen wanted = _n == 1 & _N > 1
list if wanted
+---------------------+
| id price wanted |
|---------------------|
7. | 2 4567 1 |
15. | 1 7634 1 |
+---------------------+
* approach 2
drop wanted
bysort price id : keep if _n == 1 & _N > 1
list
+------------+
| id price |
|------------|
1. | 2 4567 |
2. | 1 7634 |
+------------+
Naturally if you want to duplicate data yet further (why?) then after approach 1
gen duplicated_price = price if wanted
gives you one copy of each of the duplicated values in a new variable. This is a slightly simpler equivalent of #Pearly Spencer's approach.
bysort price id : gen duplicated_price = price if _n == 1 & _N > 1
does it in one line.

Related

Create table for asclogit and nlogit

Suppose I have the following table:
id | car | sex | income
-------------------------------
1 | European | Male | 45000
2 | Japanese | Female | 48000
3 | American | Male | 53000
I would like to create the one below:
| id | car | choice | sex | income
--------------------------------------------
1.| 1 | European | 1 | Male | 45000
2.| 1 | American | 0 | Male | 45000
3.| 1 | Japanese | 0 | Male | 45000
| ----------------------------------------
4.| 2 | European | 0 | Female | 48000
5.| 2 | American | 0 | Female | 48000
6.| 2 | Japanese | 1 | Female | 48000
| ----------------------------------------
7.| 3 | European | 0 | Male | 53000
8.| 3 | American | 1 | Male | 53000
9.| 3 | Japanese | 0 | Male | 53000
I would like to fit an asclogit and according to Example 1 in Stata's Manual, this table format seems necessary. However, i have not found a way to create this easily.
You can use the cross command to generate all the possible combinations:
clear
input byte id str10 car str8 sex long income
1 "European" "Male" 45000
2 "Japanese" "Female" 48000
3 "American" "Male" 53000
end
generate choice = 0
save old, replace
keep id
save new, replace
use old
rename id =_0
cross using new
replace choice = 1 if id_0 == id
replace sex = cond(id == 2, "Female", "Male")
replace income = cond(id == 1, 45000, cond(id == 2, 48000, 53000))
Note that the use of the cond() function here is equivalent to:
replace sex = "Male" if id == 1
replace sex = "Female" if id == 2
replace sex = "Male" if id == 3
replace income = 45000 if id == 1
replace income = 48000 if id == 2
replace income = 53000 if id == 3
The above code snipped produces the desired output:
drop id_0
order id car choice sex income
sort id car
list, sepby(id)
+------------------------------------------+
| id car choice sex income |
|------------------------------------------|
1. | 1 American 0 Male 45000 |
2. | 1 European 1 Male 45000 |
3. | 1 Japanese 0 Male 45000 |
|------------------------------------------|
4. | 2 American 0 Female 48000 |
5. | 2 European 0 Female 48000 |
6. | 2 Japanese 1 Female 48000 |
|------------------------------------------|
7. | 3 American 1 Male 53000 |
8. | 3 European 0 Male 53000 |
9. | 3 Japanese 0 Male 53000 |
+------------------------------------------+
For more information, type help cross and help cond() from Stata's command prompt.
Please see dataex in Stata for how to produce data examples useful in web forums. (If necessary, install first using ssc install dataex.)
This could be an exercise in using fillin followed by filling in the missings.
* Example generated by -dataex-. To install: ssc install dataex
clear
input byte id str10 car str8 sex long income
1 "European" "Male" 45000
2 "Japanese" "Female" 48000
3 "American" "Male" 53000
end
fillin id car
foreach v in sex income {
bysort id (_fillin) : replace `v' = `v'[1]
}
list , sepby(id)
+-------------------------------------------+
| id car sex income _fillin |
|-------------------------------------------|
1. | 1 European Male 45000 0 |
2. | 1 American Male 45000 1 |
3. | 1 Japanese Male 45000 1 |
|-------------------------------------------|
4. | 2 Japanese Female 48000 0 |
5. | 2 European Female 48000 1 |
6. | 2 American Female 48000 1 |
|-------------------------------------------|
7. | 3 American Male 53000 0 |
8. | 3 European Male 53000 1 |
9. | 3 Japanese Male 53000 1 |
+-------------------------------------------+
A provisional solution using Pandas in Python is the following:
1) Open the base with:
df = pd.read_stata("mybase.dta")
2) Use the code of the accepted answer of this question.
3) Save the base:
df.to_stata("newbase.dta")
If one wants to use dummy variables, reshape also is an option.
clear
input byte id str10 car str8 sex long income
1 "European" "Male" 45000
2 "Japanese" "Female" 48000
3 "American" "Male" 53000
end
tabulate car, gen(choice)
reshape long choice, i(id)
label define car 2 "European" 3 "Japanese" 1 "American"
drop car
rename _j car
label values car car
list, sepby(id)
+------------------------------------------+
| id car sex income choice |
|------------------------------------------|
1. | 1 American Male 45000 0 |
2. | 1 European Male 45000 1 |
3. | 1 Japanese Male 45000 0 |
|------------------------------------------|
4. | 2 American Female 48000 0 |
5. | 2 European Female 48000 0 |
6. | 2 Japanese Female 48000 1 |
|------------------------------------------|
7. | 3 American Male 53000 1 |
8. | 3 European Male 53000 0 |
9. | 3 Japanese Male 53000 0 |
+------------------------------------------+

Creating complete data in stata

I have the following purchasing data
clear
input id productid purchase
1 1 1
2 1 1
3 2 1
1 3 1
end
I want to add a row for every id-productid combo to create the following dataset
id productid purchase
1 1 1
2 1 1
3 1 0
1 2 0
2 2 0
3 2 1
1 3 1
2 3 0
3 3 0
end
I have tried a lot that has not work. This is my latest.
qui sum id, d
local obs = r(N)
expand = `obs'
levelsof productid, local(id)
local j = 1
foreach i of local id {
replace productid = `i' if `j' == id
local j = `j' + 1
}
The fillin command (see help fillin) is the tool for this task.
Starting with your sample data in memory:
fillin id productid
replace purchase = 0 if _fillin
drop _fillin
sort productid id
list, sepby(productid) abbreviate(12)
produces
+---------------------------+
| id productid purchase |
|---------------------------|
1. | 1 1 1 |
2. | 2 1 1 |
3. | 3 1 0 |
|---------------------------|
4. | 1 2 0 |
5. | 2 2 0 |
6. | 3 2 1 |
|---------------------------|
7. | 1 3 1 |
8. | 2 3 0 |
9. | 3 3 0 |
+---------------------------+

Stacking variables for each unique ID

I am using Stata 13 to stack several variables into one variable using
stack stand1-stand10, into(all)
However, I need to do it for each unique id which is pasted parallel to all, something like:
bysort familyid: stack stand1-stand10,into(all) keep familyid
We can use a simpler analogue of your data example.
clear
set obs 3
gen familyid = _n
forval j = 1/3 {
gen stand`j' = _n * `j'
}
list
+-------------------------------------+
| familyid stand1 stand2 stand3 |
|-------------------------------------|
1. | 1 1 2 3 |
2. | 2 2 4 6 |
3. | 3 3 6 9 |
+-------------------------------------+
save original
To stack with an identifier, just repeat the identifier variable name. For more than a few variables, it's easiest to prepare a call using a loop.
forval j = 1/3 {
local call `call' familyid stand`j'
}
di "`call'"
familyid stand1 familyid stand2 familyid stand3
stack `call', into(familyid stand)
sort familyid _stack
list, sepby(familyid)
+---------------------------+
| _stack familyid stand |
|---------------------------|
1. | 1 1 1 |
2. | 2 1 2 |
3. | 3 1 3 |
|---------------------------|
4. | 1 2 2 |
5. | 2 2 4 |
6. | 3 2 6 |
|---------------------------|
7. | 1 3 3 |
8. | 2 3 6 |
9. | 3 3 9 |
+---------------------------+
That said, it's easier to use reshape long.
use original, clear
reshape long stand, i(familyid) j(which)
list, sepby(familyid)
+--------------------------+
| familyid which stand |
|--------------------------|
1. | 1 1 1 |
2. | 1 2 2 |
3. | 1 3 3 |
|--------------------------|
4. | 2 1 2 |
5. | 2 2 4 |
6. | 2 3 6 |
|--------------------------|
7. | 3 1 3 |
8. | 3 2 6 |
9. | 3 3 9 |
+--------------------------+

How to overwrite a duplicate observation

I conducted a phone survey and here is the prototype of my dataset:
var1 var2
6666 1
6666 2
7676 2
7676 1
8876 1
8876 2
89898 1
89898 2
9999 1
9999 2
5656 1
5656 2
2323 1
2323 2
9876 1
7654 1
var1 is the unique identifier for each case in my survey (in this case, phone numbers).
var2 is the outcome of the survey: 1 (successful), 2 (not successful).
I want keep the observations for each var1 whose var2 == 1, yet retaining the observations for each var1 whosevar2 == 2 if there is no another case where var2 == 1.
I have tried
duplicates drop var1 if var2 == 2, force
but I am not getting the desired output
The question is wrongly titled: you don't want to overwrite anything.
Your syntax doesn't work as you wish because it is not what you want. You are asking whether there are duplicates of var1 if var2 == 2 and that command pays no attention whatsoever to observations for which var2 == 1.
Your example includes no observations for which var2 == 2 but there is no corresponding observation with var2 == 1. I have added one such.
Here's one way of meeting your goal. I show in passing that the duplicates command you have does nothing for this example; nor would it be expected to do anything.
. clear
. input var1 var2
var1 var2
1. 6666 1
2. 6666 2
3. 7676 2
4. 7676 1
5. 8876 1
6. 8876 2
7. 89898 1
8. 89898 2
9. 9999 1
10. 9999 2
11. 5656 1
12. 5656 2
13. 2323 1
14. 2323 2
15. 9876 1
16. 7654 1
17. 42 2
18. end
. duplicates list var1 if var2 == 2
Duplicates in terms of var1
(0 observations are duplicates)
. bysort var1 (var2) : assert _N == 1 | _N == 2
. by var1 : drop if _n == 2 & var2[2] == 2
(7 observations deleted)
. list, sepby(var1)
+--------------+
| var1 var2 |
|--------------|
1. | 42 2 |
|--------------|
2. | 2323 1 |
|--------------|
3. | 5656 1 |
|--------------|
4. | 6666 1 |
|--------------|
5. | 7654 1 |
|--------------|
6. | 7676 1 |
|--------------|
7. | 8876 1 |
|--------------|
8. | 9876 1 |
|--------------|
9. | 9999 1 |
|--------------|
10. | 89898 1 |
+--------------+
Another way to do it would be
. bysort var1 (var2) : keep if _n == 1 & var2[2] == 2
In fact
. bysort var1 (var2): keep if _n == 1
keeps observations with var2 == 1 if there are any and otherwise will also keep singletons with var2 == 2.
The hidden assumptions seem to include at most two observations for each distinct var1. Note the use of assert for checking assumptions about the dataset.

Stata: How to count the number of 'active' cases in a group when new case is opened?

I'm relatively new to Stata and am trying to count the number of active cases an employee has open over time in my dataset (see link below for example). I tried writing a loop using forvalues based on an example I found online, but keep getting
invalid syntax
For each EmpID I want to count the number of cases that employee had open when a new case was added to the queue. So if a case is added with an OpenDate of 03/15/2015 and the EmpID has two other cases open at the time, the code would assign a value of 2 to NumActiveWhenOpened field. A case is considered active if (1) its OpenDate is less then the new case's OpenDate & (2) its CloseDate is greater than the new case's OpenDate.
The link below provides an example. I'm trying to write a loop that creates the NumActiveWhenOpened column. Any help would be greatly appreciated. Thanks!
http://i.stack.imgur.com/z4iyR.jpg
EDIT
Here is the code that is not working. I'm sure there are several things wrong with it and I'm not sure how to store the count in the [NumActiveWhenOpen] field.
by EmpID: generate CaseNum = _n
egen group = group(EmpID)
su group, meanonly
gen NumActiveWhenOpen = 0
forvalues i = 1/ 'r(max)' {
forvalues x = 1/CaseNum if group == `i'{
count if OpenDate[_n] > OpenDate[_n-x] & CloseDate[_n-x] > OpenDate[_n]
}
}
This sounds like a problem discussed in http://www.stata-journal.com/article.html?article=dm0068 but let's try to be self-contained. I am not sure that I understand the definitions, but this may help.
I'll steal part of Roberto Ferrer's sandbox.
clear
set more off
input ///
caseid str15(open close) empid
1 "1/1/2010" "3/1/2010" 1
2 "2/5/2010" "" 1
3 "2/15/2010" "4/7/2010" 1
4 "3/5/2010" "" 1
5 "3/15/2010" "6/15/2010" 1
6 "3/24/2010" "3/24/2010" 1
1 "1/1/2010" "3/1/2010" 2
2 "2/5/2010" "" 2
3 "2/15/2010" "4/7/2010" 2
4 "3/5/2010" "" 2
5 "3/15/2010" "6/15/2010" 2
end
gen d1 = date(open, "MDY")
gen d2 = date(close, "MDY")
format %td d1 d2
drop open close
reshape long d, i(empid caseid) j(status)
replace status = -1 if status == 2
replace status = . if missing(d)
bysort empid (d) : gen nopen = sum(status)
bysort empid d : replace nopen = nopen[_N]
l
The idea is to reshape so that each pair of dates becomes two observations. Then if we code each opening by 1 and each closing by -1 the total number of active cases is their cumulative sum. That's all. Here are the results:
. l, sepby(empid)
+---------------------------------------------+
| empid caseid status d nopen |
|---------------------------------------------|
1. | 1 1 1 01jan2010 1 |
2. | 1 2 1 05feb2010 2 |
3. | 1 3 1 15feb2010 3 |
4. | 1 1 -1 01mar2010 2 |
5. | 1 4 1 05mar2010 3 |
6. | 1 5 1 15mar2010 4 |
7. | 1 6 1 24mar2010 4 |
8. | 1 6 -1 24mar2010 4 |
9. | 1 3 -1 07apr2010 3 |
10. | 1 5 -1 15jun2010 2 |
11. | 1 2 . . 2 |
12. | 1 4 . . 2 |
|---------------------------------------------|
13. | 2 1 1 01jan2010 1 |
14. | 2 2 1 05feb2010 2 |
15. | 2 3 1 15feb2010 3 |
16. | 2 1 -1 01mar2010 2 |
17. | 2 4 1 05mar2010 3 |
18. | 2 5 1 15mar2010 4 |
19. | 2 3 -1 07apr2010 3 |
20. | 2 5 -1 15jun2010 2 |
21. | 2 4 . . 2 |
22. | 2 2 . . 2 |
+---------------------------------------------+
The bottom line is no loops needed, but by: helps mightily. A detail useful here is that the cumulative sum function sum() ignores missings.
Try something along the lines of
clear
set more off
*----- example data -----
input ///
caseid str15(open close) empid numact
1 "1/1/2010" "3/1/2010" 1 0
2 "2/5/2010" "" 1 1
3 "2/15/2010" "4/7/2010" 1 2
4 "3/5/2010" "" 1 2
5 "3/15/2010" "6/15/2010" 1 3
6 "3/24/2010" "3/24/2010" 1 .
1 "1/1/2010" "3/1/2010" 2 0
2 "2/5/2010" "" 2 1
3 "2/15/2010" "4/7/2010" 2 2
4 "3/5/2010" "" 2 2
5 "3/15/2010" "6/15/2010" 2 3
end
gen opend = date(open, "MDY")
gen closed = date(close, "MDY")
format %td opend closed
drop open close
order empid
list, sepby(empid)
*----- what you want -----
gen numact2 = .
sort empid caseid
forvalues i = 1/`=_N' {
count if empid[`i'] == empid & /// a different count for each employee
opend[`i'] <= closed /// the date condition
in 1/`i' // no need to look at cases that have not yet occurred
replace numact2 = r(N) - 1 in `i'
}
list, sepby(empid)
This is resource intensive so if you have a large data set, it will take some time. The reason is it loops over observations checking conditions. See help stored results and help return for an explanation of r(N).
A good read is
Stata tip 51: Events in intervals, The Stata Journal, by Nicholas J. Cox.
Note how I provided an example data set within the code (see help input). That is how I recommend you do it for future questions. This will save other people's time and increase the probabilities of you getting an answer.