Dataset description:
I have a highly unbalanced panel dataset, with some unique panelist IDs appearing only once, while others appear as much as 4,900 times. Each observation reflects an alcohol purchase associated with a unique product identifier (UPC). If my panelist purchased two separate brands (hence, two different UPCs) in the same day, same store, two distinct observations are created. However, seeing that these purchases were made on the same day and same store, I could safely assume that it was just one trip. Similarly, another panelist who also has 2 observations associated with the same store BUT different days of purchase (or vice versa) is assumed to make 2 store visits.
Task:
I would like to explore qualities of those people who purchased alcohol a certain number of times in the whole period. Thus, I need to identify panelists who made only 1) 1 visit, 2) 2 visits, 3) between 5 and 10 visits, 4) between 50 and 100 visits, etc.
I started by trying to identify panelists who made only 1 visit by tagging them by panelist id, day, and store. However, the program also tags the first occurrence of those who appear twice or more.
egen tag = tag(panid day store)
I also tried collapse but realized that it might not be the best solution because I want to keep my observations "as is" without aggregating any variables.
I will appreciate if you can provide me insight on how to identify such observations.
UPDATE:
panid units dollars iri_key upc day tag
1100560 1 5.989 234140 00-01-18200-00834 47 1
1101253 1 13.99 652159 00-03-71990-09516 251 1
1100685 1 20.99 652159 00-01-18200-53030 18 1
1100685 1 15.99 652159 00-01-83783-37512 18 0
1101162 1 19.99 652159 00-01-34100-15341 206 1
1101162 1 19.99 652159 00-01-34100-15341 235 1
1101758 1 12.99 652159 00-01-18200-43381 30 1
1101758 1 6.989 652159 00-01-18200-16992 114 1
1101758 1 11.99 652159 00-02-72311-23012 121 1
1101758 2 21.98 652159 00-02-72311-23012 128 1
1101758 1 19.99 652159 00-01-18200-96550 223 1
1101758 1 12.99 234140 00-04-87692-29103 247 1
1101758 1 20.99 234140 00-01-18200-96550 296 1
1101758 1 12.99 234140 00-01-87692-11103 296 0
1101758 1 12.99 652159 00-01-87692-11103 317 1
1101758 1 19.99 652159 00-01-18200-96550 324 1
1101758 1 12.99 652159 00-02-87692-68103 352 1
1101758 1 12.99 652159 00-01-87692-32012 354 1
Hi Roberto, thanks for the feedback. This is a small sample of the dataset.
In the first part of this particular example, we can safely assume that all three ids 1100560, 1101253, and 1100685 visited a store only once, i.e. made only one transaction each. The first two panelists obviously have only one record each, and the third panelist purchased 2 different UPCs in the same store, same day, i.e. in the same transaction.
The second part of the example has two panelists - 1101162 and 1101758 - who made more than one transaction: two and eleven, respectively. (Panelist 1101758 has 12 observations, but only 11 distinct trips.)
I would like to identify an exact number of distinct trips (or transactions) panelists of my dataset made:
panid units dollars iri_key upc day tag total#oftrips
1100560 1 5.989 234140 00-01-18200-00834 47 1 1
1101253 1 13.99 652159 00-03-71990-09516 251 1 1
1100685 1 20.99 652159 00-01-18200-53030 18 1 1
1100685 1 15.99 652159 00-01-83783-37512 18 0 1
1101162 1 19.99 652159 00-01-34100-15341 206 1 2
1101162 1 19.99 652159 00-01-34100-15341 235 1 2
1101758 1 12.99 652159 00-01-18200-43381 30 1 11
1101758 1 6.989 652159 00-01-18200-16992 114 1 11
1101758 1 11.99 652159 00-02-72311-23012 121 1 11
1101758 2 21.98 652159 00-02-72311-23012 128 1 11
1101758 1 19.99 652159 00-01-18200-96550 223 1 11
1101758 1 12.99 234140 00-04-87692-29103 247 1 11
1101758 1 20.99 234140 00-01-18200-96550 296 1 11
1101758 1 12.99 234140 00-01-87692-11103 296 0 11
1101758 1 12.99 652159 00-01-87692-11103 317 1 11
1101758 1 19.99 652159 00-01-18200-96550 324 1 11
1101758 1 12.99 652159 00-02-87692-68103 352 1 11
1101758 1 12.99 652159 00-01-87692-32012 354 1 11
Bottom line, I guess, is - as long as panelist, iri_key, and day are the same, this would count as 1 trip. The total number of trips per panelists will depend on an additional number of distinct panelist, iri_key, and day combinations.
I'm not sure I understand exactly what you want, but here's my guess:
clear all
set more off
*----- example data -----
input ///
id code day store
1 1 86 1
1 1 45 1
1 3 45 1
1 3 4 4
2 1 86 1
2 1 45 1
2 3 45 1
end
format day %td
list, sepby(id)
*----- what you want? -----
egen tag = tag(id day store)
bysort id: egen totvis = total(tag)
bysort id store: egen totvis2 = total(tag)
list, sepby(id)
which will result in:
+--------------------------------------------------------+
| id code day store tag totvis totvis2 |
|--------------------------------------------------------|
1. | 1 3 05jan1960 4 1 3 1 |
2. | 1 1 15feb1960 1 1 3 2 |
3. | 1 3 15feb1960 1 0 3 2 |
4. | 1 1 27mar1960 1 1 3 2 |
|--------------------------------------------------------|
5. | 2 1 15feb1960 1 1 2 2 |
6. | 2 3 15feb1960 1 0 2 2 |
7. | 2 1 27mar1960 1 1 2 2 |
+--------------------------------------------------------+
This means person 1 made a total of 3 visits (considering all stores), and of those, 1 was to store 4 and 2 to store 1. Person 2 made 2 visits, both to store 1.
Related
I wish to collapse my dataset and (A) obtain medians by group, and (B) obtain the 95% confidence intervals for those medians.
I can achieve (A) by using collapse (p50) median = cost, by(group).
I can obtain the confidence intervals for the groups using bysort group: centile cost, c(50) but I ideally want to do this in a manner similar to collapse where I can create a collapsed dataset of means, lower limits (ll) and upper limits (ul) for each group (so I can export the dataset for graphing in Excel).
Data example:
input id group cost
1 0 20
2 0 40
3 0 50
4 0 40
5 0 30
6 1 20
7 1 10
8 1 10
9 1 60
10 1 30
end
Desired dataset (or something similar):
. list
+-----------------------+
| group p50 ll ul |
|-----------------------|
1. | 0 40 20 50 |
2. | 1 20 10 60 |
+-----------------------+
clear
input id group cost
1 0 20
2 0 40
3 0 50
4 0 40
5 0 30
6 1 20
7 1 10
8 1 10
9 1 60
10 1 30
end
statsby median=r(c_1) ub=r(ub_1) lb=r(lb_1), by(group) clear: centile cost
list
+--------------------------+
| group median ub lb |
|--------------------------|
1. | 0 40 50 20 |
2. | 1 20 60 10 |
+--------------------------+
In addition to the usual help and manual entry, this paper includes a riff on essentially this problem of accumulating estimates and confidence intervals.
Values are for two groups by quarter.
In DAX, need to summarize all the data but also need to remove -3 from each quarter in 2021 for Group 1, without allowing the value to go below 0.
This only impacts:
Group 1 Only
2021 Only
However, I also need to retain the data details without the adjustment. So I can't do this in Power Query. My data detail is actually in months but I'm only listing one date per quarter for brevity.
Data:
Group
Date
Value
1
01/01/2020
10
1
04/01/2020
8
1
07/01/2020
18
1
10/01/2020
2
1
01/01/2021
12
1
04/01/2021
3
1
07/01/2021
7
1
10/01/2021
2
2
01/01/2020
10
2
04/01/2020
8
2
07/01/2020
18
2
10/01/2020
2
2
01/01/2021
12
2
04/01/2021
3
2
07/01/2021
7
2
10/01/2021
2
Result:
Group
Qtr/Year
Value
1
Q1-2020
10
1
Q2-2020
8
1
Q3-2020
18
1
Q4-2020
2
1
2020
38
1
Q1-2021
9
1
Q2-2021
0
1
Q3-2021
4
1
Q4-2021
0
1
2021
13
2
Q1-2020
10
2
Q2-2020
8
2
Q3-2020
18
2
Q4-2020
2
2
2020
2
2
Q1-2021
12
2
Q2-2021
3
2
Q3-2021
7
2
Q4-2021
2
2
2021
24
You issue can be solved by using Matrix Table, and also to add new column to process value before create the table:
First, add a new column using following formula:
Revised value =
var newValue = IF(YEAR(Sheet1[Date])=2021,Sheet1[Value]-3,Sheet1[Value])
return
IF(newValue <0,0,newValue)
Second, create the matrix table for the desired outcome:
I have two datasets. One dataset here
contains information on product assortment at grocery store/day level. This data reflects all the products that were available at a store in a given day.
Another data set
contains data on individuals who visited those stores on a given day.
As you can see in screenshot 2 the same person (highlighted, panid=1101758) only bought 2 products: Michelob and Sam Adams in week 1677 2 at store 234140, whereas we know that overall 4 options were available to that individual in that store on that same day, i.e. 2 additional Budweisers (screenshot 1, highlighted obs.)
I need to merge/append these two datasets at the store/day for each individual in a way that the final data set shows that a person made those two purchases and in addition there were two more that were available to that individual at that store/day. Thus, that specific individual will have 4 observations - 2 purchased and 2 more available options. I have various stores, days, and individuals.
input store day brand
1 1 "Bud"
1 1 "Bud"
1 1 "Michelob"
1 1 "Sam Adams"
1 1 "Coors"
end
input hh store day brand
1 1 1 "Michelob"
1 1 1 "Sam Adams"
2 1 1 "Bud"
2 1 1 "Bud"
3 1 1 "Coors"
end
In the Stata code above you can see that it was another individual who purchased 2 Budweisers. For that individual a similar action has to also take place, where it can be shown that the individual had 4 options to choose from (Michelob, Sam Adams, Budweiser, Budweiser) but they ended up choosing only 2 Budweisers.
Here is the example of the end result I would like to receive:
input hh store day brand choice
1 1 1 "Michelob" 1
1 1 1 "Sam Adams" 1
1 1 1 "Bud" 0
1 1 1 "Bud" 0
1 1 1 "Coors" 0
2 1 1 "Bud" 1
2 1 1 "Bud" 1
2 1 1 "Michelob" 0
2 1 1 "Sam Adams" 0
2 1 1 "Coors" 0
3 1 1 "Coors" 1
3 1 1 "Michelob" 0
3 1 1 "Sam Adams" 0
3 1 1 "Bud" 0
3 1 1 "Bud" 0
Here's one way to do it. It involves creating an indicator for repeated products within store and day, using joinby to create all possible combinations between hh and products by store and day, and finally a merge to get the choice variable.
// Import hh data
clear
input hh store day str9 brand
1 1 1 "Michelob"
1 1 1 "Sam Adams"
2 1 1 "Bud"
2 1 1 "Bud"
3 1 1 "Coors"
end
// Create number of duplicate products for merging
bysort store day brand: gen n_brand = _n
gen choice = 1
tempfile hh hh_join
save `hh'
// Create dataset for use with joinby to create all possible combinations
// of hh and products per day/store
drop brand n_brand choice
duplicates drop
save `hh_join'
// Import store data
clear
input store day str9 brand
1 1 "Bud"
1 1 "Bud"
1 1 "Michelob"
1 1 "Sam Adams"
1 1 "Coors"
end
// Create number of duplicate products for merging
bysort store day brand: gen n_brand = _n
// Create all possible combinations of hh and products per day/store
joinby store day using `hh_join'
order hh store day brand n_brand
sort hh store day brand n_brand
// Merge with hh data to get choice variable
merge 1:1 hh store day brand n_brand using `hh'
drop _merge
// Replace choice with 0 if missing
replace choice = 0 if missing(choice)
list, noobs sepby(hh)
And the result:
. list, noobs sepby(hh)
+-------------------------------------------------+
| hh store day brand n_brand choice |
|-------------------------------------------------|
| 1 1 1 Bud 1 0 |
| 1 1 1 Bud 2 0 |
| 1 1 1 Coors 1 0 |
| 1 1 1 Michelob 1 1 |
| 1 1 1 Sam Adams 1 1 |
|-------------------------------------------------|
| 2 1 1 Bud 1 1 |
| 2 1 1 Bud 2 1 |
| 2 1 1 Coors 1 0 |
| 2 1 1 Michelob 1 0 |
| 2 1 1 Sam Adams 1 0 |
|-------------------------------------------------|
| 3 1 1 Bud 1 0 |
| 3 1 1 Bud 2 0 |
| 3 1 1 Coors 1 1 |
| 3 1 1 Michelob 1 0 |
| 3 1 1 Sam Adams 1 0 |
+-------------------------------------------------+
I would like to replace multiple column values at the same time in a dataframe. I would like to change 2 to 1, 1 to 2.
data=data.frmae(store=c(122,323,254,435,654,342,234,344)
,cluster=c(2,2,2,1,1,3,3,3))
The problem in my code is after it changes 2 to 1 , it changes these 1's to 2.
Can I do it in dplyr or sth? Thank you
Desired data set below
store cluster
122 1
323 1
254 1
435 2
654 2
342 3
234 3
344 3
I have a dataframe, the DOCUMENT_ID is the unique id which will contain multiple words from WORD column. I need to add ids for the each word within that document.
I need to add
DOCUMENT_ID WORD COUNT
0 262056708396949504 4
1 262056708396949504 DVD 1
2 262056708396949504 Girls 1
3 262056708396949504 Gone 1
4 262056708396949504 Gras 1
5 262056708396949504 Hurricane 1
6 262056708396949504 Katrina 1
7 262056708396949504 Mardi 1
8 262056708396949504 Wild 1
10 262056708396949504 donated 1
11 262056708396949504 generated 1
13 262056708396949504 revenues 1
15 262056708396949504 themed 1
17 262056708396949504 torwhore 1
18 262056708396949504 victims 1
20 262167541718319104 18
21 262167541718319104 CCUFoodMan 1
22 262167541718319104 CCUinvolved 1
23 262167541718319104 Congrats 1
24 262167541718319104 Having 1
25 262167541718319104 K 1
29 262167541718319104 blast 1
30 262167541718319104 blasty 1
31 262167541718319104 carebrighton 1
32 262167541718319104 hurricane 1
34 262167541718319104 started 1
37 262197573421502464 21
My expected outcome:
DOCUMENT_ID WORD COUNT WORD_ID
0 262056708396949504 4 1
1 262056708396949504 DVD 1 2
2 262056708396949504 Girls 1 3
3 262056708396949504 Gone 1
4 262056708396949504 Gras 1
.........
20 262167541718319104 18 1
21 262167541718319104 CCUFoodMan 1 2
22 262167541718319104 CCUinvolved 1 3
I have added for empty cells also but can be ignored.
Answer
df['WORD_ID'] = df.groupby(['DOCUMENT_ID']).cumcount()+1
Explanation
Let's build a DataFrame.
import pandas as pd
df = pd.DataFrame({'DOCUMENT_ID' : [262056708396949504, 262056708396949504, 262056708396949504, 262056708396949504, 262167541718319104, 262167541718319104, 262167541718319104], 'WORD' : ['DVD', 'Girls', 'Gras', 'Gone', 'DVD', 'Girls', "Gone"]})
df
DOCUMENT_ID WORD
0 262056708396949504 DVD
1 262056708396949504 Girls
2 262056708396949504 Gras
3 262056708396949504 Gone
4 262167541718319104 DVD
5 262167541718319104 Girls
6 262167541718319104 Gone
Given that your words are nested within unique Document_ID, we need a group by operation.
df['WORD_ID'] = df.groupby(['DOCUMENT_ID']).cumcount()+1
Output:
DOCUMENT_ID WORD WORD_ID
0 262056708396949504 DVD 1
1 262056708396949504 Girls 2
2 262056708396949504 Gras 3
3 262056708396949504 Gone 4
4 262167541718319104 DVD 1
5 262167541718319104 Girls 2
6 262167541718319104 Gone 3