Counting number of prior observation excluding those belonging to a certain group - stata

I have loan level data which has the following structure and want to create the variable Number
Loan Borrower Lender Date Crop Country Number
1 A X 01/01/20 Coffee USA 0
2 B X 01/02/20 Coffee USA 0
3 C X 01/03/20 Coffee USA 0
4 D X 01/04/20 Coffee USA 0
5 E X 01/05/20 Banana USA 4
6 F X 01/06/20 Banana USA 4
7 G X 01/07/20 Coffee USA 2
8 H X 01/08/20 Orange USA 7
9 I X 01/09/20 Coffee USA 3
. . . . . . .
. . . . . . .
I want to number my loan based on this set of rules
How many loans has the lender issued up to this point (including this loan)
This number should only include loans in the same country as my loan
This number should exclude all loans given out in the same crop
Hence I am left with a number for each observation which states the number of loans given out by the lender in the same country as said loan but excluding those observations in the country which also occur in the same crop.
So far I tried running:
bysort Lender Country (Date): gen var = _n
The problem with this is that I don't subtract the observations which occur in the same crop.

* Example generated by -dataex-.
clear
input byte Loan str8 Borrower str6 Lender float Date str6 Crop str7 Country byte Number
1 "A" "X" 21915 "Coffee" "USA" 0
2 "B" "X" 21916 "Coffee" "USA" 0
3 "C" "X" 21917 "Coffee" "USA" 0
4 "D" "X" 21918 "Coffee" "USA" 0
5 "E" "X" 21919 "Banana" "USA" 4
6 "F" "X" 21920 "Banana" "USA" 4
7 "G" "X" 21921 "Coffee" "USA" 2
8 "H" "X" 21922 "Orange" "USA" 7
9 "I" "X" 21923 "Coffee" "USA" 3
end
format %td Date
bysort Crop (Date) : gen this = _n
bysort Crop Date (this): replace this = this[_N]
sort Loan
gen wanted1 = _n - this
bysort Country (Date) : replace this = _n
bysort Country Date (this): replace this = this[_N]
sort Loan
gen wanted2 = _n - this
list
+---------------------------------------------------------------------------------------------+
| Loan Borrower Lender Date Crop Country Number this wanted1 wanted2 |
|---------------------------------------------------------------------------------------------|
1. | 1 A X 01jan2020 Coffee USA 0 1 0 0 |
2. | 2 B X 02jan2020 Coffee USA 0 2 0 0 |
3. | 3 C X 03jan2020 Coffee USA 0 3 0 0 |
4. | 4 D X 04jan2020 Coffee USA 0 4 0 0 |
5. | 5 E X 05jan2020 Banana USA 4 5 4 0 |
|---------------------------------------------------------------------------------------------|
6. | 6 F X 06jan2020 Banana USA 4 6 4 0 |
7. | 7 G X 07jan2020 Coffee USA 2 7 2 0 |
8. | 8 H X 08jan2020 Orange USA 7 8 7 0 |
9. | 9 I X 09jan2020 Coffee USA 3 9 3 0 |
+---------------------------------------------------------------------------------------------+

Related

How to recode separate variables from a multiple response survey question into one variable

I am trying to recode a variable that indicates total number of responses to a multiple response survey question. Question 4 has options 1, 2, 3, 4, 5, 6, and participants may choose one or more options when submitting a response. The data is currently coded as binary outputs for each option: var Q4___1 = yes or no (1/0), var Q4___2 = yes or no (1/0), and so forth.
This is the tabstat of all yes (1) responses to the 6 Q4___* variables
Variable | Sum
-------------+----------
q4___1 | 63
q4___2 | 33
q4___3 | 7
q4___4 | 2
q4___5 | 3
q4___6 | 7
------------------------
total = 115
I would like to create a new variable that encapsulates these values.
Can someone help me figure out how to create this variable, and if coding a variable in this manner for a multiple option survey question is valid?
When I used the replace command the total number of responses were not adding up, as shown below
gen q4=.
replace q4 =1 if q4___1 == 1
replace q4 =2 if q4___2 == 1
replace q4 =3 if q4___3 == 1
replace q4 =4 if q4___4 == 1
replace q4 =5 if q4___5 == 1
replace q4 =6 if q4___6 == 1
label values q4 primarysource`
q4 | Freq. Percent Cum.
------------+-----------------------------------
1 | 46 48.94 48.94
2 | 31 32.98 81.91
3 | 6 6.38 88.30
4 | 1 1.06 89.36
5 | 3 3.19 92.55
6 | 7 7.45 100.00
------------+-----------------------------------
Total | 94 100.00
UPDATE
to specify I am trying to create a new variable that captures the column sum of each question, not the rowtotal across all questions. I know that 63 participants responded yes to question 4 a) and 33 to question 4 b) so I want my new variable to reflect that.
This is what I want my new variable's values to look like.
q4
-------------+----------
q4___1 | 63
q4___2 | 33
q4___3 | 7
q4___4 | 2
q4___5 | 3
q4___6 | 7
------------------------
total = 115
The fallacy here is ignoring the possibility of multiple 1s as answers to the various Q4???? variables. For example if someone answers 1 1 1 1 1 1 to all questions, they appear in your final variable only in respect of their answer to the 6th question. Otherwise put, your code overwrites and so ignores all positive answers before the last positive answer.
What is likely to be more useful are
(1) the total across all 6 questions which is just
egen Q4_total = rowtotal(Q4????)
where the 4 instances of ? mean that by eye I count 3 underscores and 1 numeral.
(2) a concatenation of responses that is just
egen Q4_concat = concat(Q4????)
(3) a variable that is a concatenation of questions with positive responses, so 246 if those questions were answered 1 and the others were answered 0.
gen Q4_pos = ""
forval j = 1/6 {
replace Q4_pos = Q4_pos + "`j'" if Q4____`j' == 1
}
EDIT
Here is a test script giving concrete examples.
clear
set obs 6
forval j = 1/6 {
gen Q`j' = _n <= `j'
}
list
egen rowtotal = rowtotal(Q?)
su rowtotal, meanonly
di r(sum)
* install from tab_chi on SSC
tabm Q?
Results:
. list
+-----------------------------+
| Q1 Q2 Q3 Q4 Q5 Q6 |
|-----------------------------|
1. | 1 1 1 1 1 1 |
2. | 0 1 1 1 1 1 |
3. | 0 0 1 1 1 1 |
4. | 0 0 0 1 1 1 |
5. | 0 0 0 0 1 1 |
|-----------------------------|
6. | 0 0 0 0 0 1 |
+-----------------------------+
. egen rowtotal = rowtotal(Q?)
. su rowtotal, meanonly
. di r(sum)
21
. tabm Q?
| values
variable | 0 1 | Total
-----------+----------------------+----------
Q1 | 5 1 | 6
Q2 | 4 2 | 6
Q3 | 3 3 | 6
Q4 | 2 4 | 6
Q5 | 1 5 | 6
Q6 | 0 6 | 6
-----------+----------------------+----------
Total | 15 21 | 36

Populating new variable using vlookup with multiple criteria in another variable

1) A new variable should be created for each unique observation listed in variable sku, which contains repeated values.
2) These newly created variables should be assigned the value of own product's price at the store/week level, as long as observations' sku value is in the same subcategory (subc) as the variable itself. For example, in eta2,3, observations in line 3, 4, and 5 have the same value because they all belong to the same subcategory as sku #3. [eta2,3 indicates sku 3, subc 2.]
3) x indicates that this is the original value for the product/subcategory that is currently being replicated.
4) If an observation doesn't belong to the same subcategory, it should reflect "0".
Orange is the given data. In green are the values from the steps 1, 2, and 3. White cells are step 4.
I am unable to offer a solution of my own, as searching for a
way to generate a variable using existing observations hasn't given me results.
I also understand that it must be a combination of forvalues, foreach, and levelsof commands?
clear
input units price sku week store subc
3 4.3 1 1 1 1
2 3 2 1 1 1
1 2.5 3 1 1 2
4 12 5 1 1 2
5 12 6 1 1 3
35 4.3 1 1 2 1
23 3 2 1 2 1
12 2.5 3 1 2 2
35 12 5 1 2 2
35 12 6 1 2 3
3 20 1 2 1 1
2 30 2 2 1 1
4 40 3 2 2 2
1 50 4 2 2 2
9 10 5 2 2 2
2 90 6 2 2 3
end
UPDATE
Based on Nick Cox' feedback, this is the final code that gives the result I have been looking for:
clear
input units price sku week store subc
35 4.3 1 1 1 1
23 3 2 1 1 1
12 2.5 3 1 1 2
10 1 4 1 1 2
35 12 5 1 1 2
35 12 6 1 1 3
35 5.3 1 2 1 1
23 4 2 2 1 1
12 3.5 3 2 1 2
10 2 4 2 1 2
35 13 5 2 1 2
35 13 6 2 1 3
end
egen joint = group(subc sku), label
bysort store week : gen freq = _N
su freq, meanonly
local jmax = r(max)
drop freq
tostring subc sku, replace
gen new = subc + "_"+sku
su joint, meanonly
forval j = 1/`r(max)'{
local J = new[`j']
gen eta`J' = .
}
sort subc week store sku
egen joint1 = group(subc week store), label
gen long id = _n
su joint1, meanonly
quietly forval i = 1/`r(max)' {
su id if joint1 == `i', meanonly
local jmin = r(min)
local jmax = r(max)
forval j = `jmin'/`jmax' {
local subc = subc[`j']
local sku = sku[`j']
replace eta`subc'_`sku' = price[`j'] in `jmin'/`jmax'
replace eta`subc'_`sku' = 0 in `j'/`j'
}
}
I worry on your behalf that in a dataset of any size what you ask for would mean many, many extra variables. I wonder on your behalf whether you need all of them any way for whatever you want to do with them.
That aside, this seems to be what you want. Naturally your column headers in your spreadsheet view aren't legal variable names. Disclosure: despite being the original author of levelsof I wouldn't prefer its use here.
clear
input units price sku week store subc
35 4.3 1 1 1 1
23 3 2 1 1 1
12 2.5 3 1 1 2
10 1 4 1 1 2
35 12 5 1 1 2
35 12 6 1 1 3
end
sort subc sku
* subc identifiers guaranteed to be integers 1 up
egen subc_id = group(subc), label
* observation numbers in a variable
gen long id = _n
* how many subc? loop over the range
su subc_id, meanonly
forval i = 1/`r(max)' {
* which subc is this one? look it up using -summarize-
* assuming that subc is numeric!
su subc if subc_id == `i', meanonly
local I = r(min)
* which observation numbers for this subc?
* given the prior sort, they are all contiguous
su id if subc_id == `i', meanonly
* for each observation in the subc, find out the sku and copy its price
* to all observations in that subc
forval j = `r(min)'/`r(max)' {
local J = sku[`j']
gen eta_`I'_`J' = cond(subc_id == `i', price[`j'], 0)
}
}
list subc eta*, sepby(subc)
+------------------------------------------------------------------+
| subc eta_1_1 eta_1_2 eta_2_3 eta_2_4 eta_2_5 eta_3_6 |
|------------------------------------------------------------------|
1. | 1 4.3 3 0 0 0 0 |
2. | 1 4.3 3 0 0 0 0 |
|------------------------------------------------------------------|
3. | 2 0 0 2.5 1 12 0 |
4. | 2 0 0 2.5 1 12 0 |
5. | 2 0 0 2.5 1 12 0 |
|------------------------------------------------------------------|
6. | 3 0 0 0 0 0 12 |
+------------------------------------------------------------------+
Notes:
N1. In your example, subc is numbered 1, 2, etc. My extra variable subc_id ensures that to be true even if in your real data the identifiers are not so clean.
N2. The expression
cond(subc_id == `i', price[`j'], 0)
could also be
(subc_id == `i') * price[`j']
N3. It seems possible that a different data structure would be much more efficient.
EDIT: Here is code and results for another data structure.
clear
input units price sku week store subc
35 4.3 1 1 1 1
23 3 2 1 1 1
12 2.5 3 1 1 2
10 1 4 1 1 2
35 12 5 1 1 2
35 12 6 1 1 3
end
sort subc sku
egen subc_id = group(subc), label
bysort subc : gen freq = _N
su freq, meanonly
local jmax = r(max)
drop freq
forval j = 1/`jmax' {
gen eta`j' = .
gen which`j' = .
}
gen long id = _n
su subc_id, meanonly
quietly forval i = 1/`r(max)' {
su id if subc_id == `i', meanonly
local jmin = r(min)
local jmax = r(max)
local k = 1
forval j = `jmin'/`jmax' {
replace which`k' = sku[`j'] in `jmin'/`jmax'
replace eta`k' = price[`j'] in `jmin'/`jmax'
local ++k
}
}
list subc sku *1 *2 *3 , sepby(subc)
+------------------------------------------------------------+
| subc sku eta1 which1 eta2 which2 eta3 which3 |
|------------------------------------------------------------|
1. | 1 1 4.3 1 3 2 . . |
2. | 1 2 4.3 1 3 2 . . |
|------------------------------------------------------------|
3. | 2 3 2.5 3 1 4 12 5 |
4. | 2 4 2.5 3 1 4 12 5 |
5. | 2 5 2.5 3 1 4 12 5 |
|------------------------------------------------------------|
6. | 3 6 12 6 . . . . |
+------------------------------------------------------------+
I am adding another answer that tackles combinations of subc and week. Previous discussion establishes that what you are trying to do would add an extra variable for every observation. This can't be a good idea! At best, you might just have many new variables, mostly zeros. At worst, you will run into Stata's limits.
Hence I won't support your endeavour to go further down the same road, but show how the second data structure I discuss in my previous answer can be produced. Indeed, you haven't indicated (a) why you want all these variables, which are just the existing data redistributed; (b) what your strategy is for dealing with them; (c) why rangestat (SSC) or some other program could not remove the need to create them in the first place.
clear
input units price sku week store subc
35 4.3 1 1 1 1
23 3 2 1 1 1
12 2.5 3 1 1 2
10 1 4 1 1 2
35 12 5 1 1 2
35 12 6 1 1 3
35 5.3 1 2 1 1
23 4 2 2 1 1
12 3.5 3 2 1 2
10 2 4 2 1 2
35 13 5 2 1 2
35 13 6 2 1 3
end
sort subc week sku
egen joint = group(subc week), label
bysort joint : gen freq = _N
su freq, meanonly
local jmax = r(max)
drop freq
forval j = 1/`jmax' {
gen eta`j' = .
gen which`j' = .
}
gen long id = _n
su joint, meanonly
quietly forval i = 1/`r(max)' {
su id if joint == `i', meanonly
local jmin = r(min)
local jmax = r(max)
local k = 1
forval j = `jmin'/`jmax' {
replace which`k' = sku[`j'] in `jmin'/`jmax'
replace eta`k' = price[`j'] in `jmin'/`jmax'
local ++k
}
}
list subc week sku *1 *2 *3 , sepby(subc week)
+-------------------------------------------------------------------+
| subc week sku eta1 which1 eta2 which2 eta3 which3 |
|-------------------------------------------------------------------|
1. | 1 1 1 4.3 1 3 2 . . |
2. | 1 1 2 4.3 1 3 2 . . |
|-------------------------------------------------------------------|
3. | 1 2 1 5.3 1 4 2 . . |
4. | 1 2 2 5.3 1 4 2 . . |
|-------------------------------------------------------------------|
5. | 2 1 3 2.5 3 1 4 12 5 |
6. | 2 1 4 2.5 3 1 4 12 5 |
7. | 2 1 5 2.5 3 1 4 12 5 |
|-------------------------------------------------------------------|
8. | 2 2 3 3.5 3 2 4 13 5 |
9. | 2 2 4 3.5 3 2 4 13 5 |
10. | 2 2 5 3.5 3 2 4 13 5 |
|-------------------------------------------------------------------|
11. | 3 1 6 12 6 . . . . |
|-------------------------------------------------------------------|
12. | 3 2 6 13 6 . . . . |
+-------------------------------------------------------------------+
clear
input units price sku week store subc
35 4.3 1 1 1 1
23 3 2 1 1 1
12 2.5 3 1 1 2
10 1 4 1 1 2
35 12 5 1 1 2
35 12 6 1 1 3
35 5.3 1 2 1 1
23 4 2 2 1 1
12 3.5 3 2 1 2
10 2 4 2 1 2
35 13 5 2 1 2
35 13 6 2 1 3
end
egen joint = group(subc sku), label
bysort store week : gen freq = _N
su freq, meanonly
local jmax = r(max)
drop freq
tostring subc sku, replace
gen new = subc + "_"+sku
su joint, meanonly
forval j = 1/`r(max)'{
local J = new[`j']
gen eta`J' = .
}
sort subc week store sku
egen joint1 = group(subc week store), label
gen long id = _n
su joint1, meanonly
quietly forval i = 1/`r(max)' {
su id if joint1 == `i', meanonly
local jmin = r(min)
local jmax = r(max)
forval j = `jmin'/`jmax' {
local subc = subc[`j']
local sku = sku[`j']
replace eta`subc'_`sku' = price[`j'] in `jmin'/`jmax'
replace eta`subc'_`sku' = 0 in `j'/`j'
}
}
list subc sku store week eta*, sepby(subc)
+---------------------------------------------------------------------------------+
| store week subc sku eta1_1 eta1_2 eta2_3 eta2_4 eta2_5 eta3_6 |
|---------------------------------------------------------------------------------|
1. | 1 1 1 2 4.3 0 . . . . |
2. | 1 1 1 1 0 3 . . . . |
|---------------------------------------------------------------------------------|
3. | 1 1 2 4 . . 2.5 0 12 . |
4. | 1 1 2 3 . . 0 1 12 . |
5. | 1 1 2 5 . . 2.5 1 0 . |
|---------------------------------------------------------------------------------|
6. | 1 1 3 6 . . . . . 0 |
|---------------------------------------------------------------------------------|
7. | 1 2 1 2 5.3 0 . . . . |
8. | 1 2 1 1 0 4 . . . . |
|---------------------------------------------------------------------------------|
9. | 1 2 2 3 . . 0 2 13 . |
10. | 1 2 2 5 . . 3.5 2 0 . |
11. | 1 2 2 4 . . 3.5 0 13 . |
|---------------------------------------------------------------------------------|
12. | 1 2 3 6 . . . . . 0 |
+---------------------------------------------------------------------------------+

Save list of distinct values of a variable in another variable

I have data at the country-year-z level, where z is a categorical variable that can take(say) 10 different values (for each country-year). Each combination of country-year-z is unique in the dataset.
I would like to obtain a dataset at the country-year level, with a new (string) variable containing all distinct values of z.
For instance let's say I have the following data:
country year z
A 2000 1
A 2001 1
A 2001 2
A 2001 4
A 2002 2
A 2002 5
B 2001 7
B 2001 8
B 2002 4
B 2002 5
B 2002 9
B 2003 3
B 2003 4
B 2005 1
I would like to get the following data:
country year z_distinct
A 2000 1
A 2001 1 2
A 2002 2 5
B 2001 7 8
B 2002 4 5 9
B 2003 3 4
B 2003 4
Here's another way to do it, perhaps more direct. If z is already a string variable the string() calls should both be omitted.
clear
input str1 country year z
A 2000 1
A 2001 1
A 2001 2
A 2001 4
A 2002 2
A 2002 5
B 2001 7
B 2001 8
B 2002 4
B 2002 5
B 2002 9
B 2003 3
B 2003 4
B 2005 1
end
bysort country year (z) : gen values = string(z[1])
by country year : replace values = values[_n-1] + " " + string(z) if z != z[_n-1] & _n > 1
by country year : keep if _n == _N
drop z
list , sepby(country)
+-------------------------+
| country year values |
|-------------------------|
1. | A 2000 1 |
2. | A 2001 1 2 4 |
3. | A 2002 2 5 |
|-------------------------|
4. | B 2001 7 8 |
5. | B 2002 4 5 9 |
6. | B 2003 3 4 |
7. | B 2005 1 |
+-------------------------+
I think there may be some problems with your desired output given your input, but otherwise something like this should do it:
clear
input str1 country year z
"A" 2000 1
"A" 2001 1
"A" 2001 2
"A" 2001 4
"A" 2002 2
"A" 2002 5
"B" 2001 7
"B" 2001 8
"B" 2002 4
"B" 2002 5
"B" 2002 9
"B" 2003 3
"B" 2003 4
"B" 2005 1
end
gen z_distinct = "";
egen c_x_y = group(country year)
levelsof c_x_y, local(pairs)
foreach p of local pairs {
qui levelsof z if c_x_y == `p', clean separate(" ")
qui replace z_distinct = "`r(levels)'" if c_x_y==`p'
}
collapse (first) z_distinct, by(country year)
sort country year
The code loops over country-years, calculating the observed values of z using levelsof, and then collapses to get one row for each country-year.

Identify group with two variables

Suppose I have the following data in Stata:
clear
input id tna ret str2 name
1 2 3 "X"
1 3 2 "X"
1 5 3 "X"
1 6 -1 "X"
2 4 2 "X"
2 6 -1 "X"
2 8 -2 "X"
2 9 3 "P"
2 11 -2 "P"
3 3 1 "Y"
3 4 0 "Y"
3 6 -1 "Y"
3 8 1 "Z"
3 6 1 "Z"
end
I want to make an ID for new groups. These new groups should incorporate the observations with the same name (for example X), but should also incorporate all the observations of the same ID if the name started in that ID. For example:
X is in the data set under two IDs: 1 and 2. The group of X should incorporate all the observations with the name X, but also the two observations of the name P (since X started in ID 2 and the two observations with value P belong to group X)
Y started in ID 3, so the group should incorporate every observation with ID 3.
This is a tricky problem to solve because it may take several pass to completely stabilize identifiers. Fortunately, you can use group_id (from SSC) to solve this. To install group_id, type in Stata's Command window:
ssc install group_id
Here's a more complicated data example where "P" also appears in ID == 4 and that ID also contains "A" as a name:
* Example generated by -dataex-. To install: ssc install dataex
clear
input float(id tna ret) str2 name
1 2 3 "X"
1 3 2 "X"
1 5 3 "X"
1 6 -1 "X"
2 4 2 "X"
2 6 -1 "X"
2 8 -2 "X"
2 9 3 "P"
2 11 -2 "P"
3 3 1 "Y"
3 4 0 "Y"
3 6 -1 "Y"
3 8 1 "Z"
3 6 1 "Z"
4 9 3 "P"
4 11 -2 "P"
4 12 0 "A"
end
clonevar newid = id
group_id newid, match(name)
I am not sure that I understand the definitions here (e.g. tna and ret are not explained; conversely, omit them from a question if they are irrelevant; does "start" imply a process in time?), but why not copy first values of name within each id, and then classify on first names? (With your example data, results are the same.)
clear
input id tna ret str2 name
1 2 3 "X"
1 3 2 "X"
1 5 3 "X"
1 6 -1 "X"
2 4 2 "X"
2 6 -1 "X"
2 8 -2 "X"
2 9 3 "P"
2 11 -2 "P"
3 3 1 "Y"
3 4 0 "Y"
3 6 -1 "Y"
3 8 1 "Z"
3 6 1 "Z"
end
sort id, stable
by id: gen first = name[1]
egen group = group(first), label
list, sepby(group)
+---------------------------------------+
| id tna ret name first group |
|---------------------------------------|
1. | 1 2 3 X X X |
2. | 1 3 2 X X X |
3. | 1 5 3 X X X |
4. | 1 6 -1 X X X |
5. | 2 4 2 X X X |
6. | 2 6 -1 X X X |
7. | 2 8 -2 X X X |
8. | 2 9 3 P X X |
9. | 2 11 -2 P X X |
|---------------------------------------|
10. | 3 3 1 Y Y Y |
11. | 3 4 0 Y Y Y |
12. | 3 6 -1 Y Y Y |
13. | 3 8 1 Z Y Y |
14. | 3 6 1 Z Y Y |
+---------------------------------------+

Transaction percentage total count in Stata

Here is a sales transactions data set in Stata format. Each row is a sale of
a specific product
in a specific week
at a specific store
in a specific city
Some products were not sold at all stores in all weeks for a given city. For all products, I would like to calculate their market availability in that city in percentage points given certain week. For example, if product A was sold in week 1 in half of all the distinct stores in city (# of available stores changing from week to week), a new column would indicate market availability of 50% for all those observations (count). For different example, in the following sample data set for week 1, my desired variable market_availability would look like this (ignore the unit_sold column for now):
week store SKU city units_sold mkt_avail
1 200059 01182007 C 5 1
1 200060 01182007 C 4 1
1 200061 01182007 C 4 1
1 200060 01182090 C 6 0.66
1 200059 01182090 C 4 0.66
1 200061 01182888 C 1 0.33
2 200059 01182007 K 4 1
2 200060 01182007 K 1 1
2 200061 01182007 K 4 1
2 200059 01182090 K 8 0.66
2 200060 01182090 K 9 0.66
2 200061 01182888 K 4 0.33
This is a Stata table:
clear
set more off
input str5 week str8 store str30 SKU units_sold str1 city
1 200059 01182007 5 C
1 200059 01182090 4 C
1 200060 01182007 4 C
1 200060 01182090 6 C
1 200061 01182007 4 C
1 200061 01182888 1 C
2 200059 01182007 4 K
2 200060 01182007 1 K
2 200061 01182007 4 K
2 200059 01182090 8 K
2 200060 01182090 9 K
2 200061 01182888 4 K
end
The problem is that in this transactions data set, the same week store city SKU combinations can appear several times because of repeated purchases; but we don't want to consider repeated observations in the calculation of our shares because we already know that a specific item was available at that time.
I begin with tagging the unique observations by week and city
egen tag = tag(week city)
I also try
egen tag1 = tag(store SKU)
Now, should I try and match them up together?
Logically, I think I need first, to sum up distinct counts of city/ week/ store /SKU; then I need to calculate the number of stores in the city/week if SKU was ever sold for that combination. And then divide the first number by the second. Any thoughts?
Your strategy seems good. You can tag distinct (not "unique") observations just once in two ways and then calculate a fraction by dividing totals. This can all be done without any file choreography. The assumption here is that there are no observations recording zero sales. But if there are, then adding if units_sold to the tag() calculations should be sufficient to ignore them.
. clear
. set more off
. input str5 week str8 store str30 SKU units_sold str1 city
week store SKU units_s~d city
1. 1 200059 01182007 5 C
2. 1 200059 01182090 4 C
3. 1 200060 01182007 4 C
4. 1 200060 01182090 6 C
5. 1 200061 01182007 4 C
6. 1 200061 01182888 1 C
7. 2 200059 01182007 4 K
8. 2 200060 01182007 1 K
9. 2 200061 01182007 4 K
10. 2 200059 01182090 8 K
11. 2 200060 01182090 9 K
12. 2 200061 01182888 4 K
13. end
. egen tag = tag(city week store SKU)
. egen stores_selling_product = total(tag), by(city week SKU)
. egen tag2 = tag(city week store)
. egen stores_in_city = total(tag2), by(city week)
. gen fraction = stores_sell/stores_in
. sort week SKU store
. l week store SKU city stores* fraction , sepby(week)
+------------------------------------------------------------------+
| week store SKU city stores~t stores~y fraction |
|------------------------------------------------------------------|
1. | 1 200059 01182007 C 3 3 1 |
2. | 1 200060 01182007 C 3 3 1 |
3. | 1 200061 01182007 C 3 3 1 |
4. | 1 200059 01182090 C 2 3 .6666667 |
5. | 1 200060 01182090 C 2 3 .6666667 |
6. | 1 200061 01182888 C 1 3 .3333333 |
|------------------------------------------------------------------|
7. | 2 200059 01182007 K 3 3 1 |
8. | 2 200060 01182007 K 3 3 1 |
9. | 2 200061 01182007 K 3 3 1 |
10. | 2 200059 01182090 K 2 3 .6666667 |
11. | 2 200060 01182090 K 2 3 .6666667 |
12. | 2 200061 01182888 K 1 3 .3333333 |
+------------------------------------------------------------------+
On the terminology of distinct and unique in Stata context, and more importantly a review of technique in this territory, see this paper.
I think that this solution is not the best, but would work as your wish:
save original,replace // keeping your original dataset
collapse (count)has_sold=units_sold if units_sold>0, by(week store SKU city) // make binary flag for counting
replace has_sold=1 // force binary flag
save tmp,replace // preserving current status
bysort week store: keep if _n==1
egen numStoreWeekly = count(has_sold), by(week) // get total number of stores in week regardless city
drop SKU has_sold // dropping temporary variables
merge m:m week store city using tmp // adding numStoreWeekly to tmp.dta ("merge m:m" was used to assign same numStoreWeekly to same week/store/city combination)
egen numStoreSold = count(has_sold), by(week city SKU) // counting stores sold by week city SKU
gen mkt_avail = numStoreSold/numStoreWeekly
drop numStoreSold numStoreWeekly _merge has_sold // dropping temporary variables
merge m:m week store city SKU using original // merging back (adding mkt_avail to original.dta )
drop _merge
sort week city SKU store