I have a dataframe, the DOCUMENT_ID is the unique id which will contain multiple words from WORD column. I need to add ids for the each word within that document.
I need to add
DOCUMENT_ID WORD COUNT
0 262056708396949504 4
1 262056708396949504 DVD 1
2 262056708396949504 Girls 1
3 262056708396949504 Gone 1
4 262056708396949504 Gras 1
5 262056708396949504 Hurricane 1
6 262056708396949504 Katrina 1
7 262056708396949504 Mardi 1
8 262056708396949504 Wild 1
10 262056708396949504 donated 1
11 262056708396949504 generated 1
13 262056708396949504 revenues 1
15 262056708396949504 themed 1
17 262056708396949504 torwhore 1
18 262056708396949504 victims 1
20 262167541718319104 18
21 262167541718319104 CCUFoodMan 1
22 262167541718319104 CCUinvolved 1
23 262167541718319104 Congrats 1
24 262167541718319104 Having 1
25 262167541718319104 K 1
29 262167541718319104 blast 1
30 262167541718319104 blasty 1
31 262167541718319104 carebrighton 1
32 262167541718319104 hurricane 1
34 262167541718319104 started 1
37 262197573421502464 21
My expected outcome:
DOCUMENT_ID WORD COUNT WORD_ID
0 262056708396949504 4 1
1 262056708396949504 DVD 1 2
2 262056708396949504 Girls 1 3
3 262056708396949504 Gone 1
4 262056708396949504 Gras 1
.........
20 262167541718319104 18 1
21 262167541718319104 CCUFoodMan 1 2
22 262167541718319104 CCUinvolved 1 3
I have added for empty cells also but can be ignored.
Answer
df['WORD_ID'] = df.groupby(['DOCUMENT_ID']).cumcount()+1
Explanation
Let's build a DataFrame.
import pandas as pd
df = pd.DataFrame({'DOCUMENT_ID' : [262056708396949504, 262056708396949504, 262056708396949504, 262056708396949504, 262167541718319104, 262167541718319104, 262167541718319104], 'WORD' : ['DVD', 'Girls', 'Gras', 'Gone', 'DVD', 'Girls', "Gone"]})
df
DOCUMENT_ID WORD
0 262056708396949504 DVD
1 262056708396949504 Girls
2 262056708396949504 Gras
3 262056708396949504 Gone
4 262167541718319104 DVD
5 262167541718319104 Girls
6 262167541718319104 Gone
Given that your words are nested within unique Document_ID, we need a group by operation.
df['WORD_ID'] = df.groupby(['DOCUMENT_ID']).cumcount()+1
Output:
DOCUMENT_ID WORD WORD_ID
0 262056708396949504 DVD 1
1 262056708396949504 Girls 2
2 262056708396949504 Gras 3
3 262056708396949504 Gone 4
4 262167541718319104 DVD 1
5 262167541718319104 Girls 2
6 262167541718319104 Gone 3
Related
I am doing a project using firm level dataset (unbalanced panel data). I have around 200,000 firms for 10 years. However, the start and end of each firm period differ: some firms start at 1990 and finish at 2000 and others start at 2005 and finish at 2015. I would like to calculate the investment rate using tangible fixed asset (TFA) which is basically (TFA(t)-TFA(t-1))/TFA(t-1) for each firm in Stata. Would you help me on this issue?
* Example generated by -dataex-. To install: ssc install dataex
clear
input long ID int dec31year double TFA
1 18992 1638309000
1 19358 1430424000
1 19723 2618977000
1 20088 2.799e+09
1 20453 3507431000
1 20819 4219361000
1 21184 4347613000
1 21549 3.9619e+09
1 21914 5100955000
1 22280 5404411000
2 19358 1.5479e+10
2 19723 1.3219e+10
2 20088 1.3387e+10
2 20453 1.4867e+10
2 20819 1.636e+10
2 21184 1.6547e+10
2 21549 1.6146e+10
2 21914 1.4011e+10
2 22280 1.3141e+10
2 22645 1.3311e+10
3 19358 3.201e+09
3 19723 2.945e+09
3 20088 2.955e+09
3 20453 2.630e+09
3 20819 2.375e+09
3 21184 2.233e+09
3 21549 2.166e+09
3 21914 2.177e+09
3 22280 2.015e+09
3 22645 2.122e+09
4 18992 1425000
4 19358 395837000
4 19723 385710000
4 20088 98745000
4 20453 20387000
4 20819 1636000
4 21184 1499000
4 21549 1365000
4 21914 1439000
4 22280 92866000
5 18992 4.5909e+10
5 19358 4.6606e+10
5 19723 4.5531e+10
5 20088 4.5645e+10
5 20453 4.627e+10
5 20819 4.6155e+10
5 21184 4.5847e+10
5 21549 4.5774e+10
5 21914 4.7443e+10
5 22280 4.7853e+10
6 19358 232641000
6 19723 231892000
6 20088 190669000
6 20453 227862000
6 20819 288878000
6 21184 302291000
6 21549 694925000
6 21914 8.190e+08
6 22280 7.730e+08
6 22645 6.480e+08
7 19358 1288758000
7 19723 1217425000
7 20088 1121128000
7 20453 1033546000
7 20819 964263000
7 21184 1020210000
7 21549 1087107000
7 21914 1272572000
7 22280 1310794000
7 22645 1227395000
8 19358 2463088000
8 19723 2630901000
8 20088 2811077000
8 20453 3041447000
8 20819 3257302000
8 21184 4388377000
8 21549 4427479000
8 21914 4741731000
8 22280 4845817000
8 22645 5005846000
9 19083 609320000
9 19448 619372000
9 19813 618904000
9 20178 853070000
9 20544 838932000
9 20909 785931000
9 21274 773765000
9 21639 760809000
9 22005 760693000
9 22370 860146000
10 18992 1617674000
10 19358 1590728000
10 19723 1554051000
10 20088 1445113000
10 20453 1351322000
10 20819 1224924000
10 21184 1081895000
10 21549 133179000
10 21914 114626000
10 22280 110914000
end
format %td dec31year
. * Example generated by -dataex-. To install: ssc install dataex
. clear
. input long ID int dec31year double TFA
ID dec31y~r TFA
1. 44 19389 857299000
2. 44 19754 1230192000
3. 44 20119 1474218000
4. 44 20484 1517779000
5. 44 20850 1542684000
6. 44 21184 1522782000
7. 44 21549 1577352000
8. 44 21914 1642480000
9. 44 22280 1506011000
10. 44 22645 1564853000
11. end
. format %td dec31year
Thanks for the data example.
. gen year = year(dec31)
. tsset ID year
Panel variable: ID (weakly balanced)
Time variable: year, 2011 to 2021
Delta: 1 unit
. gen wanted = D.TFA/L.TFA
(10 missing values generated)
. su wanted
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
wanted | 90 3.778748 29.86207 -.9197528 276.7804
Values are for two groups by quarter.
In DAX, need to summarize all the data but also need to remove -3 from each quarter in 2021 for Group 1, without allowing the value to go below 0.
This only impacts:
Group 1 Only
2021 Only
However, I also need to retain the data details without the adjustment. So I can't do this in Power Query. My data detail is actually in months but I'm only listing one date per quarter for brevity.
Data:
Group
Date
Value
1
01/01/2020
10
1
04/01/2020
8
1
07/01/2020
18
1
10/01/2020
2
1
01/01/2021
12
1
04/01/2021
3
1
07/01/2021
7
1
10/01/2021
2
2
01/01/2020
10
2
04/01/2020
8
2
07/01/2020
18
2
10/01/2020
2
2
01/01/2021
12
2
04/01/2021
3
2
07/01/2021
7
2
10/01/2021
2
Result:
Group
Qtr/Year
Value
1
Q1-2020
10
1
Q2-2020
8
1
Q3-2020
18
1
Q4-2020
2
1
2020
38
1
Q1-2021
9
1
Q2-2021
0
1
Q3-2021
4
1
Q4-2021
0
1
2021
13
2
Q1-2020
10
2
Q2-2020
8
2
Q3-2020
18
2
Q4-2020
2
2
2020
2
2
Q1-2021
12
2
Q2-2021
3
2
Q3-2021
7
2
Q4-2021
2
2
2021
24
You issue can be solved by using Matrix Table, and also to add new column to process value before create the table:
First, add a new column using following formula:
Revised value =
var newValue = IF(YEAR(Sheet1[Date])=2021,Sheet1[Value]-3,Sheet1[Value])
return
IF(newValue <0,0,newValue)
Second, create the matrix table for the desired outcome:
I'm working with a SAS table where I have ordered data that I need to sum in intervals of 5. I don't have a unique ID I can use for the group by statement and I'm struggling to find a solution.
Say I have this table
Number Name X Y
1 Susan 2 1
2 Susan 3 3
3 Susan 3 3
4 Susan 4 1
5 Susan 1 2
6 Susan 1 1
7 Susan 1 1
8 Susan 2 4
9 Susan 1 5
10 Susan 4 2
1 Steve 2 4
2 Steve 2 3
3 Steve 1 2
4 Steve 3 5
5 Steve 1 1
6 Steve 1 3
7 Steve 2 3
8 Steve 2 4
9 Steve 1 1
10 Steve 1 1
I'd want the output to look like
Number Name X Y
1-5 Susan 13 10
6-10 Susan 9 13
1-5 Steve 9 15
6-10 Steve 7 12
Is there an easy way to get output like this using proc sql? Thanks!
Try this:
proc sql;
select ceil(Number/5) as Grouping, Name, sum(X), sum(Y)
from have
group by Name, Grouping;
quit;
I am trying to reshape a variable to wide but not getting proper way to do so.
I have the day wise count dataset for each SSUID and i would like to reshape the day to wide to show the count for each SSUID in aggregate.
Dataset:
ssuid day count
1226 1 3
1226 2 7
1226 3 5
1226 4 7
1226 5 7
1226 6 6
1227 1 3
1227 2 6
1227 3 7
1227 4 4
1228 1 4
1228 2 4
1228 3 6
1228 4 7
1228 5 5
1229 1 3
1229 2 6
1229 3 6
1229 4 6
1229 5 5
I tried some code but getting the error:
count variable not constant within SSUID variable
My code:
reshape wide day, i(ssuid) j(count)
I would like to get the following result:
ssuid day1 day2 day3 day4 day5 day6
1226 3 7 5 7 7 6
1227 3 6 7 4 . .
1228 4 4 6 7 5 .
1229 3 6 6 6 5 .
The following works for me:
clear
input ssuid day count
1226 1 3
1226 2 7
1226 3 5
1226 4 7
1226 5 7
1226 6 6
1227 1 3
1227 2 6
1227 3 7
1227 4 4
1228 1 4
1228 2 4
1228 3 6
1228 4 7
1228 5 5
1229 1 3
1229 2 6
1229 3 6
1229 4 6
1229 5 5
end
reshape wide count, i(ssuid) j(day)
rename count# day#
list
+-------------------------------------------------+
| ssuid day1 day2 day3 day4 day5 day6 |
|-------------------------------------------------|
1. | 1226 3 7 5 7 7 6 |
2. | 1227 3 6 7 4 . . |
3. | 1228 4 4 6 7 5 . |
4. | 1229 3 6 6 6 5 . |
+-------------------------------------------------+
Dataset description:
I have a highly unbalanced panel dataset, with some unique panelist IDs appearing only once, while others appear as much as 4,900 times. Each observation reflects an alcohol purchase associated with a unique product identifier (UPC). If my panelist purchased two separate brands (hence, two different UPCs) in the same day, same store, two distinct observations are created. However, seeing that these purchases were made on the same day and same store, I could safely assume that it was just one trip. Similarly, another panelist who also has 2 observations associated with the same store BUT different days of purchase (or vice versa) is assumed to make 2 store visits.
Task:
I would like to explore qualities of those people who purchased alcohol a certain number of times in the whole period. Thus, I need to identify panelists who made only 1) 1 visit, 2) 2 visits, 3) between 5 and 10 visits, 4) between 50 and 100 visits, etc.
I started by trying to identify panelists who made only 1 visit by tagging them by panelist id, day, and store. However, the program also tags the first occurrence of those who appear twice or more.
egen tag = tag(panid day store)
I also tried collapse but realized that it might not be the best solution because I want to keep my observations "as is" without aggregating any variables.
I will appreciate if you can provide me insight on how to identify such observations.
UPDATE:
panid units dollars iri_key upc day tag
1100560 1 5.989 234140 00-01-18200-00834 47 1
1101253 1 13.99 652159 00-03-71990-09516 251 1
1100685 1 20.99 652159 00-01-18200-53030 18 1
1100685 1 15.99 652159 00-01-83783-37512 18 0
1101162 1 19.99 652159 00-01-34100-15341 206 1
1101162 1 19.99 652159 00-01-34100-15341 235 1
1101758 1 12.99 652159 00-01-18200-43381 30 1
1101758 1 6.989 652159 00-01-18200-16992 114 1
1101758 1 11.99 652159 00-02-72311-23012 121 1
1101758 2 21.98 652159 00-02-72311-23012 128 1
1101758 1 19.99 652159 00-01-18200-96550 223 1
1101758 1 12.99 234140 00-04-87692-29103 247 1
1101758 1 20.99 234140 00-01-18200-96550 296 1
1101758 1 12.99 234140 00-01-87692-11103 296 0
1101758 1 12.99 652159 00-01-87692-11103 317 1
1101758 1 19.99 652159 00-01-18200-96550 324 1
1101758 1 12.99 652159 00-02-87692-68103 352 1
1101758 1 12.99 652159 00-01-87692-32012 354 1
Hi Roberto, thanks for the feedback. This is a small sample of the dataset.
In the first part of this particular example, we can safely assume that all three ids 1100560, 1101253, and 1100685 visited a store only once, i.e. made only one transaction each. The first two panelists obviously have only one record each, and the third panelist purchased 2 different UPCs in the same store, same day, i.e. in the same transaction.
The second part of the example has two panelists - 1101162 and 1101758 - who made more than one transaction: two and eleven, respectively. (Panelist 1101758 has 12 observations, but only 11 distinct trips.)
I would like to identify an exact number of distinct trips (or transactions) panelists of my dataset made:
panid units dollars iri_key upc day tag total#oftrips
1100560 1 5.989 234140 00-01-18200-00834 47 1 1
1101253 1 13.99 652159 00-03-71990-09516 251 1 1
1100685 1 20.99 652159 00-01-18200-53030 18 1 1
1100685 1 15.99 652159 00-01-83783-37512 18 0 1
1101162 1 19.99 652159 00-01-34100-15341 206 1 2
1101162 1 19.99 652159 00-01-34100-15341 235 1 2
1101758 1 12.99 652159 00-01-18200-43381 30 1 11
1101758 1 6.989 652159 00-01-18200-16992 114 1 11
1101758 1 11.99 652159 00-02-72311-23012 121 1 11
1101758 2 21.98 652159 00-02-72311-23012 128 1 11
1101758 1 19.99 652159 00-01-18200-96550 223 1 11
1101758 1 12.99 234140 00-04-87692-29103 247 1 11
1101758 1 20.99 234140 00-01-18200-96550 296 1 11
1101758 1 12.99 234140 00-01-87692-11103 296 0 11
1101758 1 12.99 652159 00-01-87692-11103 317 1 11
1101758 1 19.99 652159 00-01-18200-96550 324 1 11
1101758 1 12.99 652159 00-02-87692-68103 352 1 11
1101758 1 12.99 652159 00-01-87692-32012 354 1 11
Bottom line, I guess, is - as long as panelist, iri_key, and day are the same, this would count as 1 trip. The total number of trips per panelists will depend on an additional number of distinct panelist, iri_key, and day combinations.
I'm not sure I understand exactly what you want, but here's my guess:
clear all
set more off
*----- example data -----
input ///
id code day store
1 1 86 1
1 1 45 1
1 3 45 1
1 3 4 4
2 1 86 1
2 1 45 1
2 3 45 1
end
format day %td
list, sepby(id)
*----- what you want? -----
egen tag = tag(id day store)
bysort id: egen totvis = total(tag)
bysort id store: egen totvis2 = total(tag)
list, sepby(id)
which will result in:
+--------------------------------------------------------+
| id code day store tag totvis totvis2 |
|--------------------------------------------------------|
1. | 1 3 05jan1960 4 1 3 1 |
2. | 1 1 15feb1960 1 1 3 2 |
3. | 1 3 15feb1960 1 0 3 2 |
4. | 1 1 27mar1960 1 1 3 2 |
|--------------------------------------------------------|
5. | 2 1 15feb1960 1 1 2 2 |
6. | 2 3 15feb1960 1 0 2 2 |
7. | 2 1 27mar1960 1 1 2 2 |
+--------------------------------------------------------+
This means person 1 made a total of 3 visits (considering all stores), and of those, 1 was to store 4 and 2 to store 1. Person 2 made 2 visits, both to store 1.