Stata: append by ID and time stamp - stata

I have two datasets. One dataset here
contains information on product assortment at grocery store/day level. This data reflects all the products that were available at a store in a given day.
Another data set
contains data on individuals who visited those stores on a given day.
As you can see in screenshot 2 the same person (highlighted, panid=1101758) only bought 2 products: Michelob and Sam Adams in week 1677 2 at store 234140, whereas we know that overall 4 options were available to that individual in that store on that same day, i.e. 2 additional Budweisers (screenshot 1, highlighted obs.)
I need to merge/append these two datasets at the store/day for each individual in a way that the final data set shows that a person made those two purchases and in addition there were two more that were available to that individual at that store/day. Thus, that specific individual will have 4 observations - 2 purchased and 2 more available options. I have various stores, days, and individuals.
input store day brand
1 1 "Bud"
1 1 "Bud"
1 1 "Michelob"
1 1 "Sam Adams"
1 1 "Coors"
end
input hh store day brand
1 1 1 "Michelob"
1 1 1 "Sam Adams"
2 1 1 "Bud"
2 1 1 "Bud"
3 1 1 "Coors"
end
In the Stata code above you can see that it was another individual who purchased 2 Budweisers. For that individual a similar action has to also take place, where it can be shown that the individual had 4 options to choose from (Michelob, Sam Adams, Budweiser, Budweiser) but they ended up choosing only 2 Budweisers.
Here is the example of the end result I would like to receive:
input hh store day brand choice
1 1 1 "Michelob" 1
1 1 1 "Sam Adams" 1
1 1 1 "Bud" 0
1 1 1 "Bud" 0
1 1 1 "Coors" 0
2 1 1 "Bud" 1
2 1 1 "Bud" 1
2 1 1 "Michelob" 0
2 1 1 "Sam Adams" 0
2 1 1 "Coors" 0
3 1 1 "Coors" 1
3 1 1 "Michelob" 0
3 1 1 "Sam Adams" 0
3 1 1 "Bud" 0
3 1 1 "Bud" 0

Here's one way to do it. It involves creating an indicator for repeated products within store and day, using joinby to create all possible combinations between hh and products by store and day, and finally a merge to get the choice variable.
// Import hh data
clear
input hh store day str9 brand
1 1 1 "Michelob"
1 1 1 "Sam Adams"
2 1 1 "Bud"
2 1 1 "Bud"
3 1 1 "Coors"
end
// Create number of duplicate products for merging
bysort store day brand: gen n_brand = _n
gen choice = 1
tempfile hh hh_join
save `hh'
// Create dataset for use with joinby to create all possible combinations
// of hh and products per day/store
drop brand n_brand choice
duplicates drop
save `hh_join'
// Import store data
clear
input store day str9 brand
1 1 "Bud"
1 1 "Bud"
1 1 "Michelob"
1 1 "Sam Adams"
1 1 "Coors"
end
// Create number of duplicate products for merging
bysort store day brand: gen n_brand = _n
// Create all possible combinations of hh and products per day/store
joinby store day using `hh_join'
order hh store day brand n_brand
sort hh store day brand n_brand
// Merge with hh data to get choice variable
merge 1:1 hh store day brand n_brand using `hh'
drop _merge
// Replace choice with 0 if missing
replace choice = 0 if missing(choice)
list, noobs sepby(hh)
And the result:
. list, noobs sepby(hh)
+-------------------------------------------------+
| hh store day brand n_brand choice |
|-------------------------------------------------|
| 1 1 1 Bud 1 0 |
| 1 1 1 Bud 2 0 |
| 1 1 1 Coors 1 0 |
| 1 1 1 Michelob 1 1 |
| 1 1 1 Sam Adams 1 1 |
|-------------------------------------------------|
| 2 1 1 Bud 1 1 |
| 2 1 1 Bud 2 1 |
| 2 1 1 Coors 1 0 |
| 2 1 1 Michelob 1 0 |
| 2 1 1 Sam Adams 1 0 |
|-------------------------------------------------|
| 3 1 1 Bud 1 0 |
| 3 1 1 Bud 2 0 |
| 3 1 1 Coors 1 1 |
| 3 1 1 Michelob 1 0 |
| 3 1 1 Sam Adams 1 0 |
+-------------------------------------------------+

Related

Subtract Value at Aggregate by Quarter

Values are for two groups by quarter.
In DAX, need to summarize all the data but also need to remove -3 from each quarter in 2021 for Group 1, without allowing the value to go below 0.
This only impacts:
Group 1 Only
2021 Only
However, I also need to retain the data details without the adjustment. So I can't do this in Power Query. My data detail is actually in months but I'm only listing one date per quarter for brevity.
Data:
Group
Date
Value
1
01/01/2020
10
1
04/01/2020
8
1
07/01/2020
18
1
10/01/2020
2
1
01/01/2021
12
1
04/01/2021
3
1
07/01/2021
7
1
10/01/2021
2
2
01/01/2020
10
2
04/01/2020
8
2
07/01/2020
18
2
10/01/2020
2
2
01/01/2021
12
2
04/01/2021
3
2
07/01/2021
7
2
10/01/2021
2
Result:
Group
Qtr/Year
Value
1
Q1-2020
10
1
Q2-2020
8
1
Q3-2020
18
1
Q4-2020
2
1
2020
38
1
Q1-2021
9
1
Q2-2021
0
1
Q3-2021
4
1
Q4-2021
0
1
2021
13
2
Q1-2020
10
2
Q2-2020
8
2
Q3-2020
18
2
Q4-2020
2
2
2020
2
2
Q1-2021
12
2
Q2-2021
3
2
Q3-2021
7
2
Q4-2021
2
2
2021
24
You issue can be solved by using Matrix Table, and also to add new column to process value before create the table:
First, add a new column using following formula:
Revised value =
var newValue = IF(YEAR(Sheet1[Date])=2021,Sheet1[Value]-3,Sheet1[Value])
return
IF(newValue <0,0,newValue)
Second, create the matrix table for the desired outcome:

Counting observations with duplicate ID's

I have a dataset that I am converting from wide to long format.
Currently I have 1 observation per patient, and each patient can have up to 5 aneurysms, currently recorded in wide format.
I am trying to re-arrange this dataset so that I have one observation per aneurysm instead. I have done so successfully, but now I need to label the aneurysms in a new variable called aneurysmIdentifier.
Here is a glimpse at the data. You can see how, when a patient has 4 aneurysms, I have successfully created 4 corresponding observations, however these are duplicates created via the expand function.
I am stuck at the next point, which, as mentioned, is creating a new variable aneurysmIdentifier that reads 1 if there is only one copy of the specific record_id, 1 and 2 if there are two copies and so forth all the way to 1-2-3-4-5. This would enable me to have a point of reference as to what I call aneurysm 1, 2, 3, 4 and 5 so I can keep re-arranging data to fit as such.
I have created this sketch hopefully showcasing what I mean. As you can see it counts how many duplicates there are and then counts forward up to the maximum of 5.
Can anyone push me in the right direction on how to achieve this?
Example of data:
* Example generated by -dataex-. To install: ssc install dataex
clear
input str32 record_id float aneurysmNumber
"007128de18ce5cb1635b8f27c5435ff3" 1
"00abd7bdb6283dd0ac6b97271608a122" 1
"0142103f84693c6eda416dfc55f65de1" 1
"0153826d93a58d7e1837bb98a3c21ba8" 1
"01c729ac4601e36f245fd817d8977917" 2
"01c729ac4601e36f245fd817d8977917" 2
"01dd90093fbf201a1f357e22eaff6b6a" 1
"0208e14dcabc43dd2b57e2e8b117de4d" 1
"0210f575075e5def7ffa77530ce17ef0" 1
"022cc7a9397e81cf58cd9111f9d1db0d" 1
"02afd543116a22fc7430620727b20bb5" 1
"0303ef0bd5d256cca1c836e2b70415ac" 2
"0303ef0bd5d256cca1c836e2b70415ac" 2
"041b2b0cac589d6e3b65bb924803cf1a" 1
"0536317a2bbb936e85c3eb8294b076da" 1
"06161d4668f217937cac0ac033d8d199" 1
"065e151f8bcebb27fabf8b052fd70566" 4
"065e151f8bcebb27fabf8b052fd70566" 4
"065e151f8bcebb27fabf8b052fd70566" 4
"065e151f8bcebb27fabf8b052fd70566" 4
"07196414cd6bf89d94a33e149983d102" 1
"0721c38f8275dab504fc53aebcc005ce" 4
"0721c38f8275dab504fc53aebcc005ce" 4
"0721c38f8275dab504fc53aebcc005ce" 4
"0721c38f8275dab504fc53aebcc005ce" 4
"07bef516d53279a3f5e477d56d552a2b" 1
"08678829b7e0ee6a01b17974b4d19cfa" 1
"08bb6c65e63c499ea19ac24d5113dd94" 1
"08f036417500c332efd555c76c4654a0" 1
"090c54d021b4b21c7243cec01efbeb91" 1
"09166bb44e4c5cdb8f40d402f706816e" 1
"0930159addcdc35e7dc18812522d4377" 1
"096844af91d2e266767775b0bee9105e" 1
"09884af1bb9d59803de0c74d6df57c23" 1
"09e03748da35e9d799dc5d8ddf1909b5" 1
"0a4ce4a7941ff6d1f5c217bf5a9a3bf9" 1
"0a5db40dc58e97927b407c9210aab7ba" 2
"0a5db40dc58e97927b407c9210aab7ba" 2
"0a73c992955231650965ed87e3bd52f6" 1
"0a84ab77fff74c247a525dfde8ce988c" 3
"0a84ab77fff74c247a525dfde8ce988c" 3
"0a84ab77fff74c247a525dfde8ce988c" 3
"0af333ae400f75930125bb0585f0dcf5" 1
"0af73334d9d2166191f3385de48f15d2" 1
"0b341ac8f396a8cdb88b7c658f66f653" 2
"0b341ac8f396a8cdb88b7c658f66f653" 2
"0b35cf4beb830b361d7c164371f25149" 2
"0b35cf4beb830b361d7c164371f25149" 2
"0b3e110c9765e14a5c41fadcc3cfc300" .
"0b6681f0f441e69c26106ab344ac0733" 1
"0b8d8253a8415275dbc2619e039985bb" 3
"0b8d8253a8415275dbc2619e039985bb" 3
"0b8d8253a8415275dbc2619e039985bb" 3
"0b92c26375117bf42945c04d8d6573d4" 2
"0b92c26375117bf42945c04d8d6573d4" 2
"0ba961f437f43105c357403c920bdef1" 1
"0bb601fabe1fdfa794a5272408997a2f" 1
"0c75b36e91363d596dc46bd563c3f5ef" 1
"0d461328a3bae7164ce7d3a10f366812" 1
"0d4cc4eb459301a804cbef22914f44a3" 1
"0d4e29e11bb94e922112089f3fec61ef" 2
"0d4e29e11bb94e922112089f3fec61ef" 2
"0d513c74d667f55c8f4a9836c304149c" 1
"0da25de126bb3b3ee565eff8888004c2" 2
"0da25de126bb3b3ee565eff8888004c2" 2
"0db9ae1f2201577f431b7603d0819fa6" 1
"0dd8a681f6a5d4c888831a591e57a747" 1
"0e05d6958d878368b5fb831211fad6a1" 1
"0e3ff41e0e2b2cb5ec336fd0b04e5d44" 1
"0f61e560ab56b8fea1f2593d7d3b2718" 2
"0f61e560ab56b8fea1f2593d7d3b2718" 2
"0f69f1f998984d37f133185179d63c60" 1
"1037032886a93e66406a4c910d1ef747" 2
"1037032886a93e66406a4c910d1ef747" 2
"1044b81b354b420e85ae835ea07de2d6" 1
"10620fc488346291281212a404681386" 1
"1074389c469944edf026d193a55b1148" 1
"1090d5a678119b03cddab609289a4d3c" 1
"111eebb45cef2211a2a2ff0219095e6a" 1
"11ddcbc8de8ef56cbc578fc81b602ffc" 1
"11f22488513cf717c333786c789b0289" 2
"11f22488513cf717c333786c789b0289" 2
"121552b22cee2a1eb4360b4d2534cd39" 1
"1251d707c5dc9243dc45d04beb7c3493" 1
"125689659bb3821fa81698dd72462773" 1
"127ba572433921c5bb408fc62eb9b5d7" 1
"129bea3f73e84e37d77d55fadfeb49dd" 1
"12e8dc6fb87822be26d6678cee9644f5" 1
"12f05a65f771c9675c2c5e9cdbfc33d1" 2
"12f05a65f771c9675c2c5e9cdbfc33d1" 2
"13d2bc86f1a19ed2959cd7354bc92d1d" 1
"13db5ede38e2ae1da17884c9a18df202" 1
"13f946e50df8ad74d7cf9fa05b4ad05b" 1
"146c4b8be7996a9789873fe55a47ab41" 1
"147fadd87da13a0271225d944d2a5e98" 1
"14a1dcfa015343bbefaac9a3a45769e5" 2
"14a1dcfa015343bbefaac9a3a45769e5" 2
"14d1377f74a63ffa29db2d99e7f6a1ce" 1
"150017d944a87b4c61f90034380c0659" 1
"150f6ca1ea453260eabf3472d3ebcad1" 1
end
You can go
bysort record_id: gen aneurysm_id = _n
but the results will be arbitrary unless there is some other information, say a date variable, to provide a rationale for the ordering. Let's suppose that there is a date variable date that is numeric and in good order. Then
bysort record_id (date) : gen aneurysm_id = _n
would be a suitable modification. For date read also date-time if time of day is noted and notable.

Merge databases in Stata and create new vars based on identity and value of merged data

I have two databases, DB1 and DB2, that I would like to merge, but I am having difficulties. I would like help in determining what Stata calls what I am trying to do.
DB1 has about 1000 observations and looks like:
+----------+
| date b |
|----------|
1. | 1 7 |
2. | 2 6 |
3. | 3 7 |
+----------+
DB2 consists of 65 IDs each with about 1000 observations. It looks something like:
+--------------+
| date id b |
|--------------|
1. | 1 1 4 |
2. | 2 1 4 |
3. | 3 1 5 |
4. | 1 2 9 |
5. | 2 2 8 |
6. | 3 2 7 |
7. | 1 3 1 |
8. | 2 3 2 |
9. | 3 3 1 |
+--------------+
I would like to merge DB2 with DB1 so that the ultimate database looks like:
+------------------------------+
| date b id1b id2b id3b ...|
|------------------------------|
1. | 1 7 4 9 1 ...|
2. | 2 6 4 8 2 ...|
3. | 3 7 5 7 1 ...|
+------------------------------+
I have been reading about the merge command but that alone will not create my ultimate database.
Can you direct me materials that will help me with this? What do you call what I am trying to do? I feel like I need to command Stata to generate new variables.
#William Lisowski is right. This gets you what you ask for, short of a easy rename. Whether it is the best structure for your analyses is unclear: most work with similar data would be easier with a further reshape long.
clear
input date b
1 7
2 6
3 7
end
save DB1
clear
input date id b
1 1 4
2 1 4
3 1 5
1 2 9
2 2 8
3 2 7
1 3 1
2 3 2
3 3 1
end
reshape wide b, j(id) i(date)
merge 1:1 date using DB1
Indeed, I would much more usually do something like this to get a long structure directly:
clear
input date b
1 7
2 6
3 7
end
rename b B
save DB1 , replace
clear
input date id b
1 1 4
2 1 4
3 1 5
1 2 9
2 2 8
3 2 7
1 3 1
2 3 2
3 3 1
end
merge m:1 date using DB1

Generating income of other members in family

I have a dataset like this.
FamilyID Status personID spouseID HeadID spouse_of_referenceID income
1 Head 1 2 1 2
1 Spouse of head 2 1 1 2
1 Child 3 NA 1 2
2 Head 1 3 1 3
2 Spouse of head 3 1 1 3
For every "child" I want to create a variable "parents' income" which is sum of the income of the head and the income of spouse of head.
I am thinking of something like
bysort family: egen parentsincome = if ??? status==4
because status is 4 if the person is a child.
But I am not sure how to proceed to next. I thought about using _n but I couldn't think of a real solution.
That is a weak data example: 7 variables are declared, but only 6 exemplified, and no use of Stata. "NA" isn't a Stata code for missing. Some engineering was needed to make sense of it. Statalist has advice on preparing data examples that applies here too. advice on Stata data examples
You can just get the totals conditional on a person being head or their spouse directly with egen.
clear
input FamilyID str14 Status personID spouseID HeadID spouse_of_referenceID income
1 "Head" 1 2 1 2 1000
1 "Spouse of head" 2 1 1 2 2000
1 "Child" 3 . 1 2 0
2 "Head" 1 3 1 3 3000
2 "Spouse of head" 3 1 1 3 4000
end
egen HSIncome = total(income / inlist(Status, "Head", "Spouse of head")), by(FamilyID )
list FamilyID Status personID income HSIncome, sepby(FamilyID)
+----------------------------------------------------------+
| FamilyID Status personID income HSIncome |
|----------------------------------------------------------|
1. | 1 Head 1 1000 3000 |
2. | 1 Spouse of head 2 2000 3000 |
3. | 1 Child 3 0 3000 |
|----------------------------------------------------------|
4. | 2 Head 1 3000 7000 |
5. | 2 Spouse of head 3 4000 7000 |
+----------------------------------------------------------+
See e.g. this paper Sections 9 and 10 for a review of technique.
If you're using value labels instead to show status, the code will naturally be different.
The help for egen is explicit that you shouldn't try to use _n in conjunction. This is because egen often sorts the data temporarily so observations may change their order in the dataset.

Making unbalanced panel balanced with missing observations

I am attempting to make the data balanced for my sample. My data currently looks like:
id year y
1 2000 2
1 2002 4
1 2003 5
2 2001 2
2 2002 3
....
And I would like it to look like:
id year y
1 2000 2
1 2001 .
1 2002 4
1 2003 5
2 2000 .
2 2001 2
2 2002 3
....
I have tried creating a .dta of just the year and merging it to the data; however, I can't get it to work. Essentially I would like to add rows of missing data to the panel. I realize I could just drop ids with unbalanced data, but this is not an option for my methodology.
You need to skim the Data-Management Reference Manual [D] when looking for basic data management functionality. In this case fillin does what you seem to be asking.
clear
input id year y
1 2000 2
1 2002 4
1 2003 5
2 2001 2
2 2002 3
end
fillin id year
list, sepby(id)
+-------------------------+
| id year y _fillin |
|-------------------------|
1. | 1 2000 2 0 |
2. | 1 2001 . 1 |
3. | 1 2002 4 0 |
4. | 1 2003 5 0 |
|-------------------------|
5. | 2 2000 . 1 |
6. | 2 2001 2 0 |
7. | 2 2002 3 0 |
8. | 2 2003 . 1 |
+-------------------------+