Stacking in pandas dataframe based on column name - python-2.7

I have following pandas Dataframe:
ID Year Jan_salary Jan_days Feb_salary Feb_days Mar_salary Mar_days
1 2016 4500 22 4200 18 4700 24
2 2016 3800 23 3600 19 4400 23
3 2016 5500 21 5200 17 5300 23
I want to convert this dataframe to following dataframe:
ID Year month salary days
1 2016 01 4500 22
1 2016 02 4200 18
1 2016 03 4700 24
2 2016 01 3800 23
2 2016 02 3600 19
2 2016 03 4400 23
3 2016 01 5500 21
3 2016 02 5200 17
3 2016 03 5300 23
I tried use pandas.DataFrame.stack but couldn't get the expected outcome.
I am using Python 2.7
Please guide me to reshape this Pandas dataframe.
Thanks.

df = df.set_index(['ID', 'Year'])
df.columns = df.columns.str.split('_', expand=True).rename('month', level=0)
df = df.stack(0).reset_index()
md = dict(Jan='01', Feb='02', Mar='03')
df.month = df.month.map(md)
df[['ID', 'Year', 'month', 'salary', 'days']]

I love pd.melt so that's what I used in this long-winded approach:
ldf = pd.melt(df,id_vars=['ID','Year'],
value_vars=['Jan_salary','Feb_salary','Mar_salary'],
var_name='month',value_name='salary')
rdf = pd.melt(df,id_vars=['ID','Year'],
value_vars=['Jan_days','Feb_days','Mar_days'],
value_name='days')
rdf.drop(['ID','Year','variable'],inplace=True,axis=1)
cdf = pd.concat([ldf,rdf],axis=1)
cdf['month'] = cdf['month'].str.replace('_salary','')
import calendar
def mapper(month_abbr):
# from http://stackoverflow.com/a/3418092/42346
d = {v: str(k).zfill(2) for k,v in enumerate(calendar.month_abbr)}
return d[month_abbr]
cdf['month'] = cdf['month'].apply(mapper)
Result:
>>> cdf
ID Year month salary days
0 1 2016 01 4500 22
1 2 2016 01 3800 23
2 3 2016 01 5500 21
3 1 2016 02 4200 18
4 2 2016 02 3600 19
5 3 2016 02 5200 17
6 1 2016 03 4700 24
7 2 2016 03 4400 23
8 3 2016 03 5300 23

Related

How to have a measure always show the ratio as percentage in scorecard/textbox POWER BI

There are many method of having a measure to show percentage in a column of table ,
but cannot find a method to always show the ratio of a SPECIFIC group in percentage between two category.
data sample:
YEAR MONTH TYPE AMOUNT
2020 Jan A 100
2020 Feb A 250
2020 Mar A 230
2020 Jan B 158
2020 Feb B 23
2020 Mar B 46
2019 Jan A 499
2019 Feb A 65
2019 Mar A 289
2019 Jan B 465
2019 Feb B 49
2019 Mar B 446
2018 Jan A 13
2018 Feb A 97
2018 Mar A 26
2018 Jan B 216
2018 Feb B 264
2018 Mar B 29
2018 Jan A 314
2018 Feb A 659
2018 Mar A 226
2018 Jan B 469
2018 Feb B 564
2018 Mar B 164
My Goal is always show the percentage of A compare with the total amount
YEAR and MONTH are used to synchronize with slicer.
e.g. I select YEAR = 2020 , MONTH = Jan
100/258 = 38%
Manually inputted in textbox
First, Create these following 3 measures in your table-
1.
amount_A =
CALCULATE(
SUM(pie_chart_data[AMOUNT]),
FILTER(
ALLSELECTED(pie_chart_data),
pie_chart_data[TYPE] = "A"
)
)
2.
amount_overall =
CALCULATE(
SUM(pie_chart_data[AMOUNT]),
ALLSELECTED(pie_chart_data)
)
3.
amount_A_percentage = [amount_A]/[amount_overall]
Now, add both measure amount_A and amount_overall to your donut chart's values column. And place the amount_A_percentage measure to a Card and place the card in center of the Donut chart. The presentation will be as below finally-

Facebook Graph API: page_fans_online, localizing dates to where the business is located

I am trying to replicate the data that is used in "When your fans are online" section" of a business page's insights dashboard. I am using the following parameters in the /insights/page_fans_online api call which returns the data I am after:
parameters={'period':'day','since':'2018-10-20T07:00:00','until':'2018-10-21T07:00:00','access_token':page_token['access_token'][0]}
The data returned can be seen below, where:
end_time = end_time (based on the since & until dates in the parameters)
name = metric
apiHours = hour of day returned
localDate = localized date (applied manually)
localHours = - 6 hour offset to localize to Auckland/New Zealand (applied
manually to replicate what is seen on the insights dashboard.
fansOnline = number of unique page fans online during that hour
Data:
end_time name apiHours localDate localHours fansOnline
2018-10-21T07:00:00+0000 page_fans_online 0 2018-10-19 18 21
1 2018-10-19 19 29
2 2018-10-19 20 20
3 2018-10-19 21 18
4 2018-10-19 22 20
5 2018-10-19 23 15
6 2018-10-19 0 4
7 2018-10-19 1 6
8 2018-10-19 2 5
9 2018-10-19 3 8
10 2018-10-19 4 17
11 2018-10-19 5 19
12 2018-10-19 6 26
13 2018-10-19 7 24
14 2018-10-19 8 20
15 2018-10-19 9 22
16 2018-10-19 10 19
17 2018-10-19 11 22
18 2018-10-19 12 18
19 2018-10-19 13 18
20 2018-10-19 14 18
21 2018-10-19 15 18
22 2018-10-19 16 21
23 2018-10-19 17 28
It took a while to work out that the data returned when pulling page_fans_online using the parameters specified above is for Wednesday October 19th, for a New Zealand business page.
If we look at the last row in the data above:
end_time = 2018-10-21
apiHours = 23
localDate = 2018-10-19
localHours = 17
fansOnline = 28
It is saying on 2018-10-21 # 11 pm there were 28 unique fans online. This translates to , on 2018-10-19 # 5 pm there were 28 unique fans online when the dates and times are manually localized, (I worked the offset out by checking the "When your fans online" graphs on the page insights).
There is a -54 hour offset between 2018-10-21 11:00 pm and 2018-10-19 5:00 pm, and my question is, what is the logic used behind the end_time and hour of day returned by the page_fans_online insights metric and is there any info regarding how this should be localized depending on what country the business is located?
There is only a simple description of what page_fans_online is in the page/insights docs and says the hours are in PST/PDT but that does not help with localizing the date and hour of day:
https://developers.facebook.com/docs/graph-api/reference/v3.1/insights

Pandas - Finding percentage of values in each group

I have a csv file like below.
Beat,Hour,Month,Primary Type,COUNTER
111,10AM,Apr,ASSAULT,12
111,10AM,Apr,BATTERY,5
111,10AM,Apr,BURGLARY,1
111,10AM,Apr,CRIMINAL DAMAGE,4
111,10AM,Aug,MOTOR VEHICLE THEFT,2
111,10AM,Aug,NARCOTICS,1
111,10AM,Aug,OTHER OFFENSE,18
111,10AM,Aug,THEFT,38
Now I want to find the % of each Primary Type grouped by the first three columns. For eg, For Beat = 111, Hour=10AM, Month=Apr, %Assault=12/(12+5+1+4) * 100. Can anyone give a clue on how to do this using pandas?
You can using transform sum
df['New']=df.COUNTER/df.groupby(['Beat','Hour','Month']).COUNTER.transform('sum')*100
df
Out[575]:
Beat Hour Month Primary Type COUNTER New
0 111 10AM Apr ASSAULT 12 54.545455
1 111 10AM Apr BATTERY 5 22.727273
2 111 10AM Apr BURGLARY 1 4.545455
3 111 10AM Apr CRIMINAL DAMAGE 4 18.181818
4 111 10AM Aug MOTOR VEHICLE THEFT 2 3.389831
5 111 10AM Aug NARCOTICS 1 1.694915
6 111 10AM Aug OTHER OFFENSE 18 30.508475
7 111 10AM Aug THEFT 38 64.406780

excel, vba or regex to copy values downwards based on repeated values

I have the following records:
62
STARTHERE 1.1 vol. 84 no. 1 1996 01.1 A 0 1 1996 04 24 0
STARTHERE 1.2 vol. 84 no. 2 1996 01.2 A 0 1 1996 05 23 0
STARTHERE 1.3 vol. 84 no. 3 1996 01.3 A 1 1 1996 08 13 0
STARTHERE 1.4 vol. 84 no. 4 1996 01.4 A 0 1 1996 10 15 0
STARTHERE 1.5 vol. 84 no. 5 1996 01.5 A 0 1 1997 01 22 0
STARTHERE 1.6 vol. 84 no. 6 1996 01.6 A 0 1 1997 02 10 0
63
STARTHERE 1.1 95:1 Feb 2002 1.1 A 0 1 2002 06 03 0
STARTHERE 1.2 95:2 Apr 2002 1.2 A 0 1 2002 06 17 0
STARTHERE 1.3 95:3 Jun 2002 1.3 A 0 1 2002 07 18 0
STARTHERE 1.4 95:4 Aug 2002 1.4 A 0 1 2003 02 24 0
STARTHERE 1.5 95:5 Oct 2002 1.5 A 0 1 2003 02 24 0
64
65
STARTHERE 1.1 34:1 Mar 1996 1.1 A 0 1 1996 07 16 0
STARTHERE 1.2 34:2 Jun 1996 1.2 A 0 1 1996 09 19 0
STARTHERE 1.3 34:3 Sep 1996 1.3 A 0 1 1996 12 17 0
I don't know if this is possible in excel, vba in excel or even through regex. I want to fill the lowest numerical value (e.g. 62) and replace the lower rows with values "STARTHERE" up until the next numerical value (63). Right now, it's done manually but I was thinking if there is a way of doing this mechanically. Through excel formula, VBA, or regex, as these are what I'm familiar with. So that I can get below, it's okay also that the 62 with blank value to the right are stripped but I'm fine even if it's not:
62
62 1.1 vol. 84 no. 1 1996 01.1 A 0 1 1996 04 24 0
62 1.2 vol. 84 no. 2 1996 01.2 A 0 1 1996 05 23 0
62 1.3 vol. 84 no. 3 1996 01.3 A 1 1 1996 08 13 0
62 1.4 vol. 84 no. 4 1996 01.4 A 0 1 1996 10 15 0
62 1.5 vol. 84 no. 5 1996 01.5 A 0 1 1997 01 22 0
62 1.6 vol. 84 no. 6 1996 01.6 A 0 1 1997 02 10 0
62
62 1.1 95:1 Feb 2002 1.1 A 0 1 2002 06 03 0
63 1.2 95:2 Apr 2002 1.2 A 0 1 2002 06 17 0
63 1.3 95:3 Jun 2002 1.3 A 0 1 2002 07 18 0
63 1.4 95:4 Aug 2002 1.4 A 0 1 2003 02 24 0
63 1.5 95:5 Oct 2002 1.5 A 0 1 2003 02 24 0
64
65
65 1.1 34:1 Mar 1996 1.1 A 0 1 1996 07 16 0
65 1.2 34:2 Jun 1996 1.2 A 0 1 1996 09 19 0
65 1.3 34:3 Sep 1996 1.3 A 0 1 1996 12 17 0
Many thanks!
I assume this data is from an Excel spreadsheet, with both the numerical values and the value "STARTHERE" are on the first column (column A). The other data are on column B, C, etc.
Basically, I will loop through the first column from the top to the bottom row. If the value within the selector cell is not a number, it will be equal to the one right above it. If it is, then we skip to the next cell.
Sub help()
ActiveSheet.Columns(1).NumberFormat = "0"
For i = 1 To ActiveSheet.UsedRange.Rows.count
If Not Information.IsNumeric(Cells(i, 1)) Then Cells(i, 1).value = Cells(i - 1, 1).value
Next i
End Sub

Import records from a flat txt file with records delimited spacing?

I consulted as I can configure SPOON for correct import of data, knowing that I have the data delimited by spaces.
And if it affects the import process in the penultimate record "SP_SEC" is not always a record and may be blank, affect the import?
I show the data as I have:
SP_NLE SP_LIB SP_DEP SP_PRV SP_DST SP_APP SP_APM SP_NOM SP_NAC SP_SEX SP_GRI SP_SEC SP_DOC
00000001 000090 70 03 04 BARDALES AHUANARI RENE 19111116 2 10 8
00000003 000001 25 01 01 MEZA DE RUIZ CARLOTA 19400119 2 20 1 1
00000004 000001 25 01 01 BARDALES TORRES JOYCE 19580122 2 20 9 1
00000005 244246 25 01 02 RAMIREZ RUIZ FRANCISCO 19600309 1 20 7 1
00000006 000001 25 01 01 SILVA RIVERA DE RIOS ALICIA 19570310 2 20 5 1
00000008 000001 25 01 01 PACAYA MANIHUARI MANUEL 19401215 1 10 1 1
00000009 233405 25 01 02 TORRES MUĂ‘OZ GLADYS 19650902 2 20 0 1
00000010 000508 25 01 01 OLIVOS RODAS BRITALDO 19510924 1 20 3 1
00000011 000001 25 01 01 ESCUDERO HERNANDEZ JULIA ISABEL 19351118 2 30 1
00000012 000001 25 01 01 YAICATE TARICUARIMA RICARDO 19560118 1 20 0 1
00000013 000001 25 01 01 ESPINOZA DE PINEDO ALEGRIA 19371108 2 10 1
00000014 000001 25 01 01 GARCIA PINCHI RICARDO 19650315 1 30 6 1
00000015 236352 09 01 01 LAO ESPINOZA ALINA 19601217 2 30 4 1
00000017 219532 25 01 01 YAICATE YAHUARCANI OLGA 19530706 2 10 1 1
Please aid, which must be placed in the section "Regular Expression" and the Content tab should be placed in the section of "Separator", or other value in any other section?
Any suggestions.