Generate sum of all possible combinations of id - tuples

I have a dataset with the structure that looks something like this:
Group ID Value
1 A 10
1 B 15
1 C 20
2 D 10
2 E 25
Within each Group, I want to obtain the sum of all possible combinations of two or more IDs. For instance, within group 1, I can have the following combinations: AB, AC, BC, ABC. So, in total I have four possible combinations for group 1, of which I'd like to get the sum of the variable value.
I am using the formula for combinations of N elements in groups of size R to identify how many observations I need to add to the dataset to have enough observations.
For Group 1, the number of observations I need are:
3!/((3-2)!*2!)*2 = 6 for the two-IDs combinations
3!/(3-3)!*3!)*3 = 3 for the three-IDs combination.
So a total of 9 observations. Since I already have three, I can use the command:expand 6 if Group==1. For Group 1 I would get something like
Group ID Value
1 A 10
1 B 15
1 C 20
1 A 10
1 B 15
1 C 20
1 A 10
1 B 15
1 C 20
Now, I am stuck here on how to proceed to tell Stata to identify the combinations and create the summation. Ideally, I want to create two new variables, to identify the tuples and get the summation, so something that looks like:
Group ID Value Touple Sum
1 A 10 AB 25
1 B 15 AB 25
1 A 10 AC 30
1 C 20 AC 30
1 B 15 BC 35
1 C 20 BC 35
1 A 10 ABC 45
1 B 15 ABC 45
1 C 20 ABC 45
In this way, I could then just drop the duplicates in terms of Group and Tuples. Once I have the Tuples variable, getting the sum is straightforward, but getting the Tuples, I can't get my head around it.
Any advice on how to do this?

I tried doing this with nested loops and the tuples command.
First I create and save a tempfile to store results:
clear
tempfile group_results
save `group_results', replace emptyok
Then I input and save data, along with a local for the number of groups:
clear
input Group str1 ID Value
1 A 10
1 B 15
1 C 20
2 D 10
2 E 25
2 F 13 // added to test
2 G 2 // added to test
end
sum Group
local num_groups = r(max)
tempfile base
save `base', replace
Here's the core of the code. The outer loop here iterates over Groups. Then it makes a list of the IDs in that group, and uses the tuples command to make a list of the unique combinations of those IDs, with a minimum size of 2. The k loop iterates through the number of tuples and the m loop makes an indicator for tuple membership.
forvalues i = 1/`num_groups' {
display "Starting Group `i'"
use `base' if Group==`i', clear
* Make list of IDs to get unique combos of
forvalues j = 1/`=_N' {
local tuple_list`i' = "`tuple_list`i'' " + ID[`j']
}
* Get all unique combos in list using tuples command
tuples `tuple_list`i'', display min(2)
forvalues k = 1/`ntuples' {
display "Tuple `k': `tuple`k''"
local length = wordcount("`tuple`k''")
gen intuple=0
gen tuple`k'="`tuple`k''"
forvalues m = 1/`length' {
replace intuple=1 if ID==word("`tuple`k''",`m')
}
* Calculate sum of values in that tuple
gegen group_sum`k' = sum(Value) if intuple==1
drop intuple
list
}
* Reshape into desired format
reshape long tuple group_sum, i(Group ID Value) j(tuple_num)
drop if missing(group_sum)
sort tuple_num
list
append using `group_results'
save `group_results', replace
}
* Full results
use `group_results', clear
sort Group tuple_num
list
I hope this helps. The list commands will give you a busy results window but it shows what's all happening. Here's the output at the end of the i loop for Group 1:
+--------------------------------------------------+
| Group ID Value tuple_~m tuple group_~m |
|--------------------------------------------------|
1. | 1 C 20 1 B C 35 |
2. | 1 B 15 1 B C 35 |
3. | 1 A 10 2 A C 30 |
4. | 1 C 20 2 A C 30 |
5. | 1 A 10 3 A B 25 |
|--------------------------------------------------|
6. | 1 B 15 3 A B 25 |
7. | 1 C 20 4 A B C 45 |
8. | 1 A 10 4 A B C 45 |
9. | 1 B 15 4 A B C 45 |
+--------------------------------------------------+
This could be inefficient if your data is actually much larger!

Related

Splitting values in Stata and save them in a new variable

I have a numeric variable with values similar to the following system
1
2
12
21
2
I would like to split the values which have length > 1 and put the second half of the value
in another variable.
So the second variable would have the values:
.
.
2
1
.
Theoretically I would just use a simple replace statement, but I am looking for a code/loop, which would
recognize the double digit values and split them automatically and save them in the second variable. Because with time, there will be more observations added and I cannot do this task manually for >10k cases.
Here's one approach:
clear
input foo
1
2
12
21
2
end
generate foo1 = floor(foo/10)
generate foo2 = mod(foo, 10)
list
+-------------------+
| foo foo1 foo2 |
|-------------------|
1. | 1 0 1 |
2. | 2 0 2 |
3. | 12 1 2 |
4. | 21 2 1 |
5. | 2 0 2 |
+-------------------+
More on these functions here, here and here.
If zeros for the first part should be missing, then
replace foo1 = . if foo1 == 0
or (to do it in one)
generate foo1 = floor(foo/10) if foo >= 10
The code is also good for any arguments with three digits or more.

Hamming distance for grouped results

I'm working with a data set that contains 40 different participants, with each 30 observations.
As I am observing search behavior, I want to calculate the search distance for each subject per round (from 1 to30).
In order to compare my data with current literature, I need to use the Hamming distance to describe search distances.
The variable is called Inputs and is a string variable with binary inputs 0 or 1 with a length of 10. E.g:
Input Type 1 Subject 1 Round 1: 0000011111
Input Type 1 Subject 1 Round 2: 0000011110
Using the Levensthein distance, my approach was simple:
sort type_num Subject round_num
gen input_prev=Input[_n-1]
replace input_prev="0000000000" if round_num==1 //default starting position with 0000000000 to get search distance for first input in round 1
//Levensthein distance & clearing data (Levensthein instead of hamming distance)
ustrdist Input input_prev
rename strdist input_change
I am now struggling with getting the right Stata commands for the Hamming distance. Can someone help?
Does this help? As I understand it, Hamming distance is the count of characters (bits) that differ at corresponding positions of strings of equal length. So, given two variables and wanting comparisons within each observation, it is just a loop over the characters.
clear
set obs 10
set seed 2803
* sandbox
quietly forval j = 1/2 {
gen test`j' = ""
forval k = 1/10 {
replace test`j' = test`j' + strofreal(runiform() > (`j' * 0.3))
}
}
set obs 12
replace test1 = 10 * "1" in 11
replace test2 = test1 in 11
replace test1 = test1[11] in 12
replace test2 = 10 * "0" in 12
* calculation
gen wanted = 0
quietly forval k = 1/10 {
replace wanted = wanted + (substr(test1, `k', 1) != substr(test2, `k', 1))
}
list
+----------------------------------+
| test1 test2 wanted |
|----------------------------------|
1. | 1110001111 1001101000 7 |
2. | 1111011011 1101011111 2 |
3. | 1011001111 1110110111 5 |
4. | 0000111011 1011010100 8 |
5. | 1011011011 1111100110 6 |
|----------------------------------|
6. | 0011111100 0100011110 5 |
7. | 0011011011 0011111010 2 |
8. | 1010100011 1011000100 5 |
9. | 1110011011 1010010100 5 |
10. | 1001011111 0100111001 6 |
|----------------------------------|
11. | 1111111111 1111111111 0 |
12. | 1111111111 0000000000 10 |
+----------------------------------+

Retrieve row number according to value of another variable in index

This problem is very simple in R, but I can't seem to get it to work in Stata.
I want to use the square brackets index, but with an expression in it that involves another variable, i.e. for a variable with unique values cumul I want:
replace country = country[cumul==20] in 12
cumul == 20 corresponds to row number 638 in the dataset, so the above should replace in line 12 the country variable with the value of that same variable in line 638. The above expression is clearly not the right way to do it: it just replaces the country variable in line 12 with a missing value.
Stata's row indexing does not work in that way. What you can do, however, is a simple two-line solution:
levelsof country if cumul==20
replace country = "`r(levels)'" in 12
If you want to be sure that cumul==20 uniquely identifies just a single value of country, add:
assert `:word count `r(levels)''==1
between the two lines.
It's probably worth explaining why the construct in the question doesn't work as you wish, beyond "Stata is not R!".
Given a variable x: in a reference like x[1] the [1] is referred to as a subscript, despite nothing being written below the line. The subscript is the observation number, the number being always that in the dataset as currently held in memory.
Stata allows expressions within subscripts; they are evaluated observation by observation and the result is then used to look-up values in variables. Consider this sandbox:
clear
input float y
1
2
3
4
5
end
. gen foo = y[mod(_n, 2)]
(2 missing values generated)
. gen x = 3
. gen bar = y[y == x]
(4 missing values generated)
. list
+-------------------+
| y foo x bar |
|-------------------|
1. | 1 1 3 . |
2. | 2 . 3 . |
3. | 3 1 3 1 |
4. | 4 . 3 . |
5. | 5 1 3 . |
+-------------------+
mod(_n, 2) is the remainder on dividing the observation _n by 2: that is 1 for odd observation numbers and 0 for even numbers. Observation 0 is not in the dataset (Stata starts indexing at 1). It's not an error to refer to values in that observation, but the result is returned as missing (numeric missing here, and empty strings "" if the variable is string). Hence foo is x[1] or 1 for odd observation numbers and missing for even numbers.
True or false expressions are evaluated as 1 if true and 0 is false. Thus y == x is true only in observation 3, and so bar is the value of y[1] there and missing everywhere else. Stata doesn't have the special (and useful) twist in R that it is the subscripts for which a true or false expression is true that are used to select zero or more values.
There are ways of using subscripts to get special effects. This example shows one. (It's much easier to get the same kind of result in Mata.)
. gen random = runiform()
. sort random
. gen obs = _n
. sort y
. gen randomsorted = random[obs]
. l
+-----------------------------------------------+
| y foo x bar random obs random~d |
|-----------------------------------------------|
1. | 1 1 3 . .3488717 4 .0285569 |
2. | 2 . 3 . .2668857 3 .1366463 |
3. | 3 1 3 1 .1366463 2 .2668857 |
4. | 4 . 3 . .0285569 1 .3488717 |
5. | 5 1 3 . .8689333 5 .8689333 |
+-----------------------------------------------+
This answer doesn't cover matrices in Stata or Mata.

Stata: How to count the number of 'active' cases in a group when new case is opened?

I'm relatively new to Stata and am trying to count the number of active cases an employee has open over time in my dataset (see link below for example). I tried writing a loop using forvalues based on an example I found online, but keep getting
invalid syntax
For each EmpID I want to count the number of cases that employee had open when a new case was added to the queue. So if a case is added with an OpenDate of 03/15/2015 and the EmpID has two other cases open at the time, the code would assign a value of 2 to NumActiveWhenOpened field. A case is considered active if (1) its OpenDate is less then the new case's OpenDate & (2) its CloseDate is greater than the new case's OpenDate.
The link below provides an example. I'm trying to write a loop that creates the NumActiveWhenOpened column. Any help would be greatly appreciated. Thanks!
http://i.stack.imgur.com/z4iyR.jpg
EDIT
Here is the code that is not working. I'm sure there are several things wrong with it and I'm not sure how to store the count in the [NumActiveWhenOpen] field.
by EmpID: generate CaseNum = _n
egen group = group(EmpID)
su group, meanonly
gen NumActiveWhenOpen = 0
forvalues i = 1/ 'r(max)' {
forvalues x = 1/CaseNum if group == `i'{
count if OpenDate[_n] > OpenDate[_n-x] & CloseDate[_n-x] > OpenDate[_n]
}
}
This sounds like a problem discussed in http://www.stata-journal.com/article.html?article=dm0068 but let's try to be self-contained. I am not sure that I understand the definitions, but this may help.
I'll steal part of Roberto Ferrer's sandbox.
clear
set more off
input ///
caseid str15(open close) empid
1 "1/1/2010" "3/1/2010" 1
2 "2/5/2010" "" 1
3 "2/15/2010" "4/7/2010" 1
4 "3/5/2010" "" 1
5 "3/15/2010" "6/15/2010" 1
6 "3/24/2010" "3/24/2010" 1
1 "1/1/2010" "3/1/2010" 2
2 "2/5/2010" "" 2
3 "2/15/2010" "4/7/2010" 2
4 "3/5/2010" "" 2
5 "3/15/2010" "6/15/2010" 2
end
gen d1 = date(open, "MDY")
gen d2 = date(close, "MDY")
format %td d1 d2
drop open close
reshape long d, i(empid caseid) j(status)
replace status = -1 if status == 2
replace status = . if missing(d)
bysort empid (d) : gen nopen = sum(status)
bysort empid d : replace nopen = nopen[_N]
l
The idea is to reshape so that each pair of dates becomes two observations. Then if we code each opening by 1 and each closing by -1 the total number of active cases is their cumulative sum. That's all. Here are the results:
. l, sepby(empid)
+---------------------------------------------+
| empid caseid status d nopen |
|---------------------------------------------|
1. | 1 1 1 01jan2010 1 |
2. | 1 2 1 05feb2010 2 |
3. | 1 3 1 15feb2010 3 |
4. | 1 1 -1 01mar2010 2 |
5. | 1 4 1 05mar2010 3 |
6. | 1 5 1 15mar2010 4 |
7. | 1 6 1 24mar2010 4 |
8. | 1 6 -1 24mar2010 4 |
9. | 1 3 -1 07apr2010 3 |
10. | 1 5 -1 15jun2010 2 |
11. | 1 2 . . 2 |
12. | 1 4 . . 2 |
|---------------------------------------------|
13. | 2 1 1 01jan2010 1 |
14. | 2 2 1 05feb2010 2 |
15. | 2 3 1 15feb2010 3 |
16. | 2 1 -1 01mar2010 2 |
17. | 2 4 1 05mar2010 3 |
18. | 2 5 1 15mar2010 4 |
19. | 2 3 -1 07apr2010 3 |
20. | 2 5 -1 15jun2010 2 |
21. | 2 4 . . 2 |
22. | 2 2 . . 2 |
+---------------------------------------------+
The bottom line is no loops needed, but by: helps mightily. A detail useful here is that the cumulative sum function sum() ignores missings.
Try something along the lines of
clear
set more off
*----- example data -----
input ///
caseid str15(open close) empid numact
1 "1/1/2010" "3/1/2010" 1 0
2 "2/5/2010" "" 1 1
3 "2/15/2010" "4/7/2010" 1 2
4 "3/5/2010" "" 1 2
5 "3/15/2010" "6/15/2010" 1 3
6 "3/24/2010" "3/24/2010" 1 .
1 "1/1/2010" "3/1/2010" 2 0
2 "2/5/2010" "" 2 1
3 "2/15/2010" "4/7/2010" 2 2
4 "3/5/2010" "" 2 2
5 "3/15/2010" "6/15/2010" 2 3
end
gen opend = date(open, "MDY")
gen closed = date(close, "MDY")
format %td opend closed
drop open close
order empid
list, sepby(empid)
*----- what you want -----
gen numact2 = .
sort empid caseid
forvalues i = 1/`=_N' {
count if empid[`i'] == empid & /// a different count for each employee
opend[`i'] <= closed /// the date condition
in 1/`i' // no need to look at cases that have not yet occurred
replace numact2 = r(N) - 1 in `i'
}
list, sepby(empid)
This is resource intensive so if you have a large data set, it will take some time. The reason is it loops over observations checking conditions. See help stored results and help return for an explanation of r(N).
A good read is
Stata tip 51: Events in intervals, The Stata Journal, by Nicholas J. Cox.
Note how I provided an example data set within the code (see help input). That is how I recommend you do it for future questions. This will save other people's time and increase the probabilities of you getting an answer.

Count observations within dynamic range

Consider the following example:
input group day month year number treatment NUM
1 1 2 2000 1 1 2
1 1 6 2000 2 0 .
1 1 9 2000 3 0 .
1 1 5 2001 4 0 .
1 1 1 2010 5 1 1
1 1 5 2010 6 0 .
2 1 1 2001 1 1 0
2 1 3 2002 2 1 0
end
gen date = mdy(month,day,year)
format date %td
drop day month year
For each group, I have a varying number of observations. Each observations refers to an event that is specified with a date. Variable number is the numbering within each group.
Now, I want to count the number of observations that occur one year starting from the date of each treatment observation (excluding itself) within this group. This means, I want to create the variable NUM that I have already put into my example above. I do not care about the number of observations with treatment = 0.
EDIT Begin: The following information was found to be missing but necessary to tackle this problem: The treatment variable will have a value of 1 if there is no observation within the same group in the last year. Thus it is also not possible that the variable NUM will have to consider observations with treatment = 1. In principal, it is possible that there are two observations within a group that have identical dates. EDIT End
I have looked into Stata tip 51: Events in intervals. It seems to work out however my dataset is huge (> 1 mio observations) such that it is really really inefficient - especially because I do not care about all treatment = 0 observations.
I was wondering if there is any alternative. My approach was to look for the observation with the latest date within each group that is still in the range of 1 year (and maybe store it in variable latestDate). Then I would simply subtract the value in variable number of the observation found from the value in count of the treatment = 0 variable.
Note: My "inefficient" code looks as follows
gsort -treatment
gen treatment_id = _n
replace treatment_id = . if treatment==0
gen count=.
sum treatment_id, meanonly
qui forval i = 1/`r(max)'{
count if inrange(date-date[`i'],1,365) & group == group[`i']
replace count = r(N) in `i'
}
sort group date
I am assuming that treatment can't occur within 1 year of the previous treatment (in the group). This is true in your example data, but may not be true in general. But, assuming that it is the case, then this should work. I'm using carryforward which is on SSC (ssc install carryforward). Like your latestDate thought, I determine one year after the most recent treatment and count the number of observations in that window.
sort group date
gen yrafter = (date + 365) if treatment == 1
by group: carryforward yrafter, replace
format yrafter %td
gen in_window = date <= yrafter & treatment == 0
egen answer = sum(in_window), by(group yrafter)
replace answer = . if treatment == 0
I can't promise this will be faster than a loop but I suspect that it will be.
The question is not completely clear.
Consider the following data with two different results, num2 and num3:
+-----------------------------------------+
| date2 group treat num2 num3 |
|-----------------------------------------|
| 01feb2000 1 1 3 2 |
| 01jun2000 1 0 . . |
| 01sep2000 1 0 . . |
| 01nov2000 1 1 0 0 |
| 01may2002 1 0 . . |
| 01jan2010 1 1 1 1 |
| 01may2010 1 0 . . |
|-----------------------------------------|
| 01jan2001 2 1 0 0 |
| 01mar2002 2 1 0 0 |
+-----------------------------------------+
The variable num2 is computed assuming you are interested in counting all observations that are within a one-year period after a treated observation (treat == 1), be those observations equal to 0 or 1 for treat. For example, after 01feb2000, there are three observations that comply with the time span condition; two have treat==0 and one has treat == 1, and they are all counted.
The variable num3 is also counting observations that are within a one-year period after a treated observation, but only the cases for which treat == 0.
num2 is computed with code in the spirit of the article you have cited. The use of in makes the run more efficient and there is no gsort (as in your code), which is quite slow. I have assumed that in each group there are no repeated dates:
clear
set more off
input ///
group str15 date count treat num
1 01.02.2000 1 1 2
1 01.06.2000 2 0 .
1 01.09.2000 3 0 .
1 01.11.2000 3 1 .
1 01.05.2002 4 0 .
1 01.01.2010 5 1 1
1 01.05.2010 6 0 .
2 01.01.2001 1 1 0
2 01.03.2002 2 1 0
end
list
gen date2 = date(date,"DMY")
format date2 %td
drop date count num
order date
list, sepby(group)
*----- what you want -----
gen num2 = .
isid group date, sort
forvalues j = 1/`=_N' {
count in `j'/L if inrange(date2 - date2[`j'], 1, 365) & group == group[`j']
replace num2 = r(N) in `j'
}
replace num2 = . if !treat
list, sepby(group)
num3 is computed with code similar in spirit (and results) as that posted by #jfeigenbaum:
<snip>
*----- what you want -----
isid group date, sort
by group: gen indicat = sum(treat)
sort group indicat, stable
by group indicat: egen num3 = total(inrange(date2 - date2[1], 1, 365))
replace num3 = . if !treat
list, sepby(group)
Even more than two interpretations are possible for your problem, but I'll leave it at that.
(Note that I have changed your example data to include cases that probably make the problem more realistic.)