How to resample dates with Pandas item by item? - python-2.7

My objective is to add rows in pandas in order to replace missing data with previous data and resample dates at the same time. Example :
This is what I have :
date wins losses
2015-12-19 11 5
2015-12-20 17 8
2015-12-20 10 6
2015-12-21 15 1
2015-12-25 11 5
2015-12-26 6 10
2015-12-27 10 6
2015-12-28 4 12
2015-12-29 8 11
And this is what I want :
wins losses
date
2015-12-19 11.0 5.0
2015-12-20 10.0 6.0
2015-12-21 15.0 1.0
2015-12-22 15.0 1.0
2015-12-23 15.0 1.0
2015-12-24 15.0 1.0
2015-12-25 11.0 5.0
2015-12-26 6.0 10.0
2015-12-27 10.0 6.0
2015-12-28 4.0 12.0
2015-12-29 8.0 11.0
And this is my code :
resamp = df.set_index('date').resample('D', how='last', fill_method='ffill')
It works !
But I want to do the same thing with 22 million lines (pandas), with different dates, and different IDs..
This dataframe contains two productID (1 and 2). I want to do the same previous exercice and keep the time serie data of every productID..
createdAt productId popularity
2015-12-01 1 5
2015-12-02 1 8
2015-12-04 1 6
2015-12-07 1 9
2015-12-01 2 5
2015-12-03 2 10
2015-12-04 2 6
2015-12-07 2 12
2015-12-09 2 11
This is my code :
df['date'] = pd.to_datetime(df['createdAt'])
df.set_index('date').resample('D', how='last', fill_method='ffill')
This is what I have if I use the same code ! I don't want a groupby with my dates.
createdAt productId popularity
date
2015-12-01 2015-12-01 2 5
2015-12-02 2015-12-02 2 5
2015-12-03 2015-12-03 2 10
2015-12-04 2015-12-04 2 6
2015-12-05 2015-12-05 2 6
2015-12-06 2015-12-06 2 6
2015-12-07 2015-12-07 2 12
2015-12-08 2015-12-08 2 12
2015-12-09 2015-12-09 2 11
This is what I want !
createdAt productId popularity
2015-12-01 1 5
2015-12-02 1 8
2015-12-03 1 8
2015-12-04 1 6
2015-12-05 1 6
2015-12-06 1 6
2015-12-07 1 9
2015-12-01 2 5
2015-12-02 2 5
2015-12-03 2 10
2015-12-04 2 6
2015-12-05 2 6
2015-12-06 2 6
2015-12-07 2 12
2015-12-08 2 12
2015-12-09 2 11
What to do ?
Thank you

Try this, it should works :)
print df.set_index('date').groupby('productId', group_keys=False).apply(lambda
df: df.resample('D').ffill()).reset_index()

This produces what you said you wanted.
print df.groupby('productId', group_keys=False).apply(lambda df: df.resample('D').ffill()).reset_index()
createdAt productId popularity
0 2015-12-01 1 5
1 2015-12-02 1 8
2 2015-12-03 1 8
3 2015-12-04 1 6
4 2015-12-05 1 6
5 2015-12-06 1 6
6 2015-12-07 1 9
7 2015-12-01 2 5
8 2015-12-02 2 5
9 2015-12-03 2 10
10 2015-12-04 2 6
11 2015-12-05 2 6
12 2015-12-06 2 6
13 2015-12-07 2 12
14 2015-12-08 2 12
15 2015-12-09 2 11

Related

calculating the investment rate of unbalanced panel data from firm level dataset

I am doing a project using firm level dataset (unbalanced panel data). I have around 200,000 firms for 10 years. However, the start and end of each firm period differ: some firms start at 1990 and finish at 2000 and others start at 2005 and finish at 2015. I would like to calculate the investment rate using tangible fixed asset (TFA) which is basically (TFA(t)-TFA(t-1))/TFA(t-1) for each firm in Stata. Would you help me on this issue?
* Example generated by -dataex-. To install: ssc install dataex
clear
input long ID int dec31year double TFA
1 18992 1638309000
1 19358 1430424000
1 19723 2618977000
1 20088 2.799e+09
1 20453 3507431000
1 20819 4219361000
1 21184 4347613000
1 21549 3.9619e+09
1 21914 5100955000
1 22280 5404411000
2 19358 1.5479e+10
2 19723 1.3219e+10
2 20088 1.3387e+10
2 20453 1.4867e+10
2 20819 1.636e+10
2 21184 1.6547e+10
2 21549 1.6146e+10
2 21914 1.4011e+10
2 22280 1.3141e+10
2 22645 1.3311e+10
3 19358 3.201e+09
3 19723 2.945e+09
3 20088 2.955e+09
3 20453 2.630e+09
3 20819 2.375e+09
3 21184 2.233e+09
3 21549 2.166e+09
3 21914 2.177e+09
3 22280 2.015e+09
3 22645 2.122e+09
4 18992 1425000
4 19358 395837000
4 19723 385710000
4 20088 98745000
4 20453 20387000
4 20819 1636000
4 21184 1499000
4 21549 1365000
4 21914 1439000
4 22280 92866000
5 18992 4.5909e+10
5 19358 4.6606e+10
5 19723 4.5531e+10
5 20088 4.5645e+10
5 20453 4.627e+10
5 20819 4.6155e+10
5 21184 4.5847e+10
5 21549 4.5774e+10
5 21914 4.7443e+10
5 22280 4.7853e+10
6 19358 232641000
6 19723 231892000
6 20088 190669000
6 20453 227862000
6 20819 288878000
6 21184 302291000
6 21549 694925000
6 21914 8.190e+08
6 22280 7.730e+08
6 22645 6.480e+08
7 19358 1288758000
7 19723 1217425000
7 20088 1121128000
7 20453 1033546000
7 20819 964263000
7 21184 1020210000
7 21549 1087107000
7 21914 1272572000
7 22280 1310794000
7 22645 1227395000
8 19358 2463088000
8 19723 2630901000
8 20088 2811077000
8 20453 3041447000
8 20819 3257302000
8 21184 4388377000
8 21549 4427479000
8 21914 4741731000
8 22280 4845817000
8 22645 5005846000
9 19083 609320000
9 19448 619372000
9 19813 618904000
9 20178 853070000
9 20544 838932000
9 20909 785931000
9 21274 773765000
9 21639 760809000
9 22005 760693000
9 22370 860146000
10 18992 1617674000
10 19358 1590728000
10 19723 1554051000
10 20088 1445113000
10 20453 1351322000
10 20819 1224924000
10 21184 1081895000
10 21549 133179000
10 21914 114626000
10 22280 110914000
end
format %td dec31year
. * Example generated by -dataex-. To install: ssc install dataex
. clear
. input long ID int dec31year double TFA
ID dec31y~r TFA
1. 44 19389 857299000
2. 44 19754 1230192000
3. 44 20119 1474218000
4. 44 20484 1517779000
5. 44 20850 1542684000
6. 44 21184 1522782000
7. 44 21549 1577352000
8. 44 21914 1642480000
9. 44 22280 1506011000
10. 44 22645 1564853000
11. end
. format %td dec31year
Thanks for the data example.
. gen year = year(dec31)
. tsset ID year
Panel variable: ID (weakly balanced)
Time variable: year, 2011 to 2021
Delta: 1 unit
. gen wanted = D.TFA/L.TFA
(10 missing values generated)
. su wanted
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
wanted | 90 3.778748 29.86207 -.9197528 276.7804

Subtract Value at Aggregate by Quarter

Values are for two groups by quarter.
In DAX, need to summarize all the data but also need to remove -3 from each quarter in 2021 for Group 1, without allowing the value to go below 0.
This only impacts:
Group 1 Only
2021 Only
However, I also need to retain the data details without the adjustment. So I can't do this in Power Query. My data detail is actually in months but I'm only listing one date per quarter for brevity.
Data:
Group
Date
Value
1
01/01/2020
10
1
04/01/2020
8
1
07/01/2020
18
1
10/01/2020
2
1
01/01/2021
12
1
04/01/2021
3
1
07/01/2021
7
1
10/01/2021
2
2
01/01/2020
10
2
04/01/2020
8
2
07/01/2020
18
2
10/01/2020
2
2
01/01/2021
12
2
04/01/2021
3
2
07/01/2021
7
2
10/01/2021
2
Result:
Group
Qtr/Year
Value
1
Q1-2020
10
1
Q2-2020
8
1
Q3-2020
18
1
Q4-2020
2
1
2020
38
1
Q1-2021
9
1
Q2-2021
0
1
Q3-2021
4
1
Q4-2021
0
1
2021
13
2
Q1-2020
10
2
Q2-2020
8
2
Q3-2020
18
2
Q4-2020
2
2
2020
2
2
Q1-2021
12
2
Q2-2021
3
2
Q3-2021
7
2
Q4-2021
2
2
2021
24
You issue can be solved by using Matrix Table, and also to add new column to process value before create the table:
First, add a new column using following formula:
Revised value =
var newValue = IF(YEAR(Sheet1[Date])=2021,Sheet1[Value]-3,Sheet1[Value])
return
IF(newValue <0,0,newValue)
Second, create the matrix table for the desired outcome:

cumulative average powerbi by month

I have below dataset.
Math Literature Biology date student
4 2 5 2019-08-25 A
4 5 4 2019-08-08 A
5 4 5 2019-08-23 A
5 5 5 2019-08-15 A
5 5 5 2019-07-19 A
5 5 5 2019-07-15 A
5 5 5 2019-07-03 A
5 5 5 2019-06-26 A
1 1 2 2019-06-18 A
2 3 3 2019-06-14 A
5 5 5 2019-05-01 A
2 1 3 2019-04-26 A
I need to develop a solution in powerbi so in output I have cumulative average per subject per month
For example
April May June July August
Math | 2 3.5 3 3.75 4
Literature | 1 3 3 3.75 3.83
Biology | 3 4 3.6 4.125 4.33
Can you help?
You can use a matrix visualization for this.
Create a month-year variable and use it in the columns.
Use Average of Math,Literature and Biology in values
Under the format pane --> Values --> Show on rows --> Select this
This should give the view you are looking for. You can edit the value headers to your requirement.

How to reshape a variable to wide in my dataset?

I am trying to reshape a variable to wide but not getting proper way to do so.
I have the day wise count dataset for each SSUID and i would like to reshape the day to wide to show the count for each SSUID in aggregate.
Dataset:
ssuid day count
1226 1 3
1226 2 7
1226 3 5
1226 4 7
1226 5 7
1226 6 6
1227 1 3
1227 2 6
1227 3 7
1227 4 4
1228 1 4
1228 2 4
1228 3 6
1228 4 7
1228 5 5
1229 1 3
1229 2 6
1229 3 6
1229 4 6
1229 5 5
I tried some code but getting the error:
count variable not constant within SSUID variable
My code:
reshape wide day, i(ssuid) j(count)
I would like to get the following result:
ssuid day1 day2 day3 day4 day5 day6
1226 3 7 5 7 7 6
1227 3 6 7 4 . .
1228 4 4 6 7 5 .
1229 3 6 6 6 5 .
The following works for me:
clear
input ssuid day count
1226 1 3
1226 2 7
1226 3 5
1226 4 7
1226 5 7
1226 6 6
1227 1 3
1227 2 6
1227 3 7
1227 4 4
1228 1 4
1228 2 4
1228 3 6
1228 4 7
1228 5 5
1229 1 3
1229 2 6
1229 3 6
1229 4 6
1229 5 5
end
reshape wide count, i(ssuid) j(day)
rename count# day#
list
+-------------------------------------------------+
| ssuid day1 day2 day3 day4 day5 day6 |
|-------------------------------------------------|
1. | 1226 3 7 5 7 7 6 |
2. | 1227 3 6 7 4 . . |
3. | 1228 4 4 6 7 5 . |
4. | 1229 3 6 6 6 5 . |
+-------------------------------------------------+

Finding the max(latest) date out of a column of dates then grouping them by employee

Importing the data frame
df = pd.read_csv("C:\\Users")
Printing the list of employees usernames
print (df['AssignedTo'])
Returns:
Out[4]:
0 vaughad
1 channln
2 stalasi
3 mitras
4 martil
5 erict
6 erict
7 channln
8 saia
9 channln
10 roedema
11 vaughad
Printing The Dates
Returns:
Out[6]:
0 2015-11-05
1 2016-05-27
2 2016-04-26
3 2016-02-18
4 2016-02-18
5 2015-11-02
6 2016-01-14
7 2015-12-15
8 2015-12-31
9 2015-10-16
10 2016-01-07
11 2015-11-20
Now I need to collect the latest date per employee?
I have tried:
MaxDate = max(df.FilledEnd)
But this just returns one date for all employees.
So we see multiple employees in the data set with different dates, in a new column named "LatestDate" I need the latest date that corresponds to the employee, so for "vaughad" in a new column it would return "2015-11-20" for all of "vaughad" records and in the same column for username "channln" it would return "2016-5-27" for all of "channln" latest dates.
You need to group your data first, using DataFrame.groupby(), after which you can produce aggregate values, like the maximum date in the FilledEnd series:
df.groupby('AssignedTo')['FilledEnd'].max()
This produces a series, with AssignedTo as the index, and the latest date for each of those employees as the values:
>>> df.groupby('AssignedTo')['FilledEnd'].max()
AssignedTo
channln 2016-05-27
erict 2016-01-14
martil 2016-02-18
mitras 2016-02-18
roedema 2016-01-07
saia 2015-12-31
stalasi 2016-04-26
vaughad 2015-11-20
Name: FilledEnd, dtype: object
If you wanted to add those max dates values back to the dataframe, use groupby(...).transform() with numpy.max instead, so you get a series with the same indices:
df['MaxDate'] = df.groupby('AssignedTo')['FilledEnd'].transform(np.max)
This adds in a MaxDate column:
AssignedTo FilledEnd MaxDate
0 vaughad 2015-11-05 2015-11-20
1 channln 2016-05-27 2016-05-27
2 stalasi 2016-04-26 2016-04-26
3 mitras 2016-02-18 2016-02-18
4 martil 2016-02-18 2016-02-18
5 erict 2015-11-02 2016-01-14
6 erict 2016-01-14 2016-01-14
7 channln 2015-12-15 2016-05-27
8 saia 2015-12-31 2015-12-31
9 channln 2015-10-16 2016-05-27
10 roedema 2016-01-07 2016-01-07
11 vaughad 2015-11-20 2015-11-20