Pandas add multiple new columns at once from list of lists - list

I have a list of timestamp lists where each inner list looks like this:
['Tue', 'Feb', '7', '10:07:40', '2017']
Is it possible with Pandas to add five new columns at the same time to an already created dataframe (same length as the outer list), that are equal to each of these values, with names 'day','month','date','time','year'?

I think you can use DataFrame constructor with concat:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
L = [['Tue', 'Feb', '7', '10:07:40', '2017'],
['Tue', 'Feb', '7', '10:07:40', '2017'],
['Tue', 'Feb', '7', '10:07:40', '2017']]
cols = ['day','month','date','time','year']
df1 = pd.DataFrame(L, columns=cols)
print (df1)
day month date time year
0 Tue Feb 7 10:07:40 2017
1 Tue Feb 7 10:07:40 2017
2 Tue Feb 7 10:07:40 2017
df2 = pd.concat([df, df1], axis=1)
print (df2)
A B C day month date time year
0 1 4 7 Tue Feb 7 10:07:40 2017
1 2 5 8 Tue Feb 7 10:07:40 2017
2 3 6 9 Tue Feb 7 10:07:40 2017
One liner:
df2 = pd.concat([df, pd.DataFrame(L, columns=['day','month','date','time','year'])], axis=1)
print (df2)
A B C day month date time year
0 1 4 7 Tue Feb 7 10:07:40 2017
1 2 5 8 Tue Feb 7 10:07:40 2017
2 3 6 9 Tue Feb 7 10:07:40 2017

Related

Linear descending line according to a measure

I have a calendar table where I created with M.
I'm relating it to a table of activities, where I grouped it by week.
I've calculated the total value of activities I have in that time period by a DAX measure (let's consider 5000), I need to plot a linear descending line over the period from that value (5000) to 0.
I've managed to get close results, but it doesn't stay at 0. It either exceeds or is missing 1 period of time.
Here is the current table and the expected table:
Year
Month
End of Week
Expected
2021
6
05/06/2021
2021
6
12/06/2021
2021
6
19/06/2021
2021
6
26/06/2021
2021
7
03/07/2021
2021
7
10/07/2021
2021
7
17/07/2021
2021
7
24/07/2021
2021
7
31/07/2021
2021
8
07/08/2021
2021
8
14/08/2021
2021
8
21/08/2021
2021
8
28/08/2021
2021
9
04/09/2021
2021
9
11/09/2021
2021
9
18/09/2021
2021
9
25/09/2021
2021
10
02/10/2021
2021
10
09/10/2021
2021
10
16/10/2021
2021
10
23/10/2021
2021
10
30/10/2021
2021
11
06/11/2021
2021
11
13/11/2021
2021
11
20/11/2021
2021
11
27/11/2021
2021
12
04/12/2021
2021
12
11/12/2021
2021
12
18/12/2021
2021
12
25/12/2021
2022
1
01/01/2022
2022
1
08/01/2022
2022
1
15/01/2022
2022
1
22/01/2022
2022
1
29/01/2022
2022
2
05/02/2022
EXPECTED TABLE
Year
Month
End of Week
Expected
2021
6
05/06/2021
5000
2021
6
12/06/2021
4857,143
2021
6
19/06/2021
4714,286
2021
6
26/06/2021
4571,429
2021
7
03/07/2021
4428,571
2021
7
10/07/2021
4285,714
2021
7
17/07/2021
4142,857
2021
7
24/07/2021
4000
2021
7
31/07/2021
3857,143
2021
8
07/08/2021
3714,286
2021
8
14/08/2021
3571,429
2021
8
21/08/2021
3428,571
2021
8
28/08/2021
3285,714
2021
9
04/09/2021
3142,857
2021
9
11/09/2021
3000
2021
9
18/09/2021
2857,143
2021
9
25/09/2021
2714,286
2021
10
02/10/2021
2571,429
2021
10
09/10/2021
2428,571
2021
10
16/10/2021
2285,714
2021
10
23/10/2021
2142,857
2021
10
30/10/2021
2000
2021
11
06/11/2021
1857,143
2021
11
13/11/2021
1714,286
2021
11
20/11/2021
1571,429
2021
11
27/11/2021
1428,571
2021
12
04/12/2021
1285,714
2021
12
11/12/2021
1142,857
2021
12
18/12/2021
1000
2021
12
25/12/2021
857,1429
2022
1
01/01/2022
714,2857
2022
1
08/01/2022
571,4286
2022
1
15/01/2022
428,5714
2022
1
44583
285,7143
2022
1
44590
142,8571
2022
2
44597
0
It is recommended that I remove the decimal places from the visualization of the graph of the linear line. However, I will not round the value for the line to be straight down.

Power Query/DAX to calculate monthly raw sales figure

Dear stackoverflow, please help!
I'm hoping for some assistance with data processing in Power BI, either using Power Query or DAX. At this point I am really stuck and can't figure out how to solve this problem.
The below table is a list of sales by Product, Month, and Year. The problem with my data is that the value in the sales data is actually cumulative, rather than the raw figure of sales for that month. In other words, the figure is the sum of the number of sales for the month (for that Year and Product combination) and the number of sales for the preceding month. As you will see in the table below, the number gets progressively larger in each category as the year progresses. The true number of sales for TVs in Feb of 2021, for example, is the sales figure of 3 minus the corresponding figure for sales of TVs in Jan of 2021 (1).
I really would appreciate if anyone knows of a solution to this problem. In reality, my table has hundreds of thousands of rows, so I cannot do the calculations manually.
Is there a way to use Power Query or DAX to create a calculated column with the Raw Sales figure for each month? Something that would check if Product and Year are equal, then subtract the Jan figure from the Feb figure and so on?
Any help will be very much appreciated,
Sales Table
Product
Sales (YTD)
Month
Year
TV
1
Jan
2021
Radio
4
Jan
2021
Cooker
5
Jan
2021
TV
3
Feb
2021
Radio
5
Feb
2021
Cooker
6
Feb
2021
TV
3
Mar
2021
Radio
6
Mar
2021
Cooker
8
Mar
2021
TV
5
Apr
2021
Radio
7
Apr
2021
Cooker
8
Apr
2021
TV
7
May
2021
Radio
8
May
2021
Cooker
8
May
2021
TV
9
Jun
2021
Radio
10
Jun
2021
Cooker
10
Jun
2021
TV
10
Jul
2021
Radio
10
Jul
2021
Cooker
10
Jul
2021
TV
11
Aug
2021
Radio
13
Aug
2021
Cooker
12
Aug
2021
TV
11
Sep
2021
Radio
13
Sep
2021
Cooker
12
Sep
2021
TV
12
Oct
2021
Radio
14
Oct
2021
Cooker
13
Oct
2021
TV
17
Nov
2021
Radio
19
Nov
2021
Cooker
17
Nov
2021
TV
19
Dec
2021
Radio
20
Dec
2021
Cooker
20
Dec
2021
TV
4
Jan
2022
Radio
2
Jan
2022
Cooker
3
Jan
2022
TV
5
Feb
2022
Radio
3
Feb
2022
Cooker
5
Feb
2022
Thanks, Jim
Give this a try in powerquery / M. It groups on Product and Year, then sorts the months, and subtracts each row from the next row to determine the period amount.
let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
#"Grouped Rows" = Table.Group(Source, {"Product", "Year"}, {
{"data", each
let r=Table.Sort(Table.AddIndexColumn(_, "Index", 0, 1),{ each List.PositionOf({"Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"}, [Month]), {"Month",Order.Ascending}}),
x= Table.AddColumn( r, "Period Sales", each if [Index]=0 then [#"Sales (YTD)"] else [#"Sales (YTD)"]-r{[Index]-1}[#"Sales (YTD)"])
in x
, type table }
}),
#"Expanded data" = Table.ExpandTableColumn(#"Grouped Rows", "data", {"Sales (YTD)", "Month", "Period Sales"}, {"Sales (YTD)", "Month", "Period Sales"})
in #"Expanded data"

How to obtain the "time" values of a schedule

Assuming a fixed rate bond with the schedule shown in the sample code below.
I am able to obtain the number of days between the tenors by using the businessDaysBetween function.
Now I would like the "time value". Is there a way of doing it without creating a new function?
Here is the expected result:
May 14th, 2012 .5
November 14th, 2012 .5
May 14th, 2013 .5
November 14th, 2013 .5
May 14th, 2014 .5
November 14th, 2014 .5
May 14th, 2015 .5
November 16th, 2015 .505556
May 16th, 2016 .5
November 14th, 2016 .49444
Here is the code:
from QuantLib import *
import pandas as pd
effective_date = Date(14, 11, 2011)
termination_date = Date(14, 11, 2016)
tenor = Period(Semiannual)
calendar = UnitedStates()
business_convention = ModifiedFollowing
termination_business_convention = Following
date_generation = DateGeneration.Forward
end_of_month = False
day_count = Thirty360()
schedule = Schedule(effective_date,
termination_date,
tenor,
calendar,
business_convention,
termination_business_convention,
date_generation,
end_of_month)
t = []
for i, d in enumerate(schedule):
tmp = i+1, d,
t.append(tmp)
df = pd.DataFrame(t,columns = ['tenorNo','tenorDate'])
nbDays = []
for x in df['tenorNo'] :
if x == 1:
tmp = 0
else:
tmp = calendar.businessDaysBetween(df['tenorDate'][x-2],df['tenorDate'][x-1])
nbDays.append(tmp)
df['nbDays'] = nbDays
print df
tenorNo tenorDate nbDays
0 1 November 14th, 2011 0
1 2 May 14th, 2012 125
2 3 November 14th, 2012 127
3 4 May 14th, 2013 124
4 5 November 14th, 2013 127
5 6 May 14th, 2014 124
6 7 November 14th, 2014 127
7 8 May 14th, 2015 124
8 9 November 16th, 2015 127
9 10 May 16th, 2016 125
10 11 November 14th, 2016 125
That's what DayCounter instances are for. The time will depend on the day-count convention you choose (for example, you seem to be using 30/360).
Calling
day_count.yearFraction(date1, date2)
will return the time between date1 and date2.

python: obtaining a column of dates from the columns of years-months-days

Suppose I have a very simple dataframe:
>>> a
Out[158]:
monthE yearE dayE
0 10 2014 15
1 2 2012 15
2 2 2014 15
3 12 2015 15
4 2 2012 15
Suppose that I want to create the column with the date related to every line, using three columns of integers.
When I have simple numbers it is enough to do like:
>>> datetime.date(1983,11,8)
Out[159]: datetime.date(1983, 11, 8)
If I have to create a column of dates (theoretically a very basic request), instead:
a.apply(lambda x: datetime.date(x['yearE'],x['monthE'],x['dayE']))
I obtain the following error:
KeyError: ('yearE', u'occurred at index monthE')
I think you can first remove last char E and then use to_datetime, but then get pandas timestamps not python dates:
df.columns = df.columns.str[:-1]
df['date'] = pd.to_datetime(df)
#if multiple columns filter by subset
#df['date'] = pd.to_datetime(df[['year','month','day']])
print (df)
month year day date
0 10 2014 15 2014-10-15
1 2 2012 15 2012-02-15
2 2 2014 15 2014-02-15
3 12 2015 15 2015-12-15
4 2 2012 15 2012-02-15
print (df.date.dtypes)
datetime64[ns]
print (df.date.iloc[0])
2014-10-15 00:00:00
print (type(df.date.iloc[0]))
<class 'pandas.tslib.Timestamp'>
Thank you MaxU for solution:
df['date'] = pd.to_datetime(df.rename(columns = lambda x: x[:-1]))
#if another columns in df
#df['date'] = pd.to_datetime(df[['yearE','monthE','dayE']].rename(columns=lambda x: x[:-1]))
print (df)
monthE yearE dayE date
0 10 2014 15 2014-10-15
1 2 2012 15 2012-02-15
2 2 2014 15 2014-02-15
3 12 2015 15 2015-12-15
4 2 2012 15 2012-02-15
But if really need python dates add axis=1 to apply, but then is impossible use some pandas functions:
df['date'] =df.apply(lambda x: datetime.date(x['yearE'],x['monthE'],x['dayE']), axis=1)
print (df)
monthE yearE dayE date
0 10 2014 15 2014-10-15
1 2 2012 15 2012-02-15
2 2 2014 15 2014-02-15
3 12 2015 15 2015-12-15
4 2 2012 15 2012-02-15
print (df.date.dtypes)
object
print (df.date.iloc[0])
2014-10-15
print (type(df.date.iloc[0]))
<class 'datetime.date'>

Splitting a column that contains multiple date formats

I have a csv file that contains a column with multiple date formats. I need to split them and get the extracted result in the same format.
Wednesday 12 August 2015
Wednesday 12 August 2015
Friday April 1 2016
Friday April 1 2016
5/12/2016
5/12/2016
This is the file and i want it in the mm/dd/yy format. My code is as follows:
import re
import csv
import pandas as pd
#delimiters = " ", "/"
#f = open('merged_34.csv')
f = open('test3.csv')
df = pd.read_csv('test3.csv')
for item in df['serverDatePrettyFirstAction']:
if '/' in item:
newDate.append(item)
else:
item = item.split(' ', 1)[1]
newDate.append(item)
df['newDate'] = newDate
df.to_csv('D:/Python/10.36.202.64/newfile.csv', index = False)
And this is what i get:
serverDatePrettyFirstAction newDate
Wednesday 12 August 2015 12-Aug-15
Wednesday 12 August 2015 12-Aug-15
Friday April 1 2016 April 1 2016
Friday April 1 2016 April 1 2016
5/12/2016 5/12/2016
5/12/2016 5/12/2016
Also is there a way to overwrite the values in the same column itself
a faster approach would be to use pandas's method to_datetime():
In [2]: df
Out[2]:
Date
0 Wednesday 12 August 2015
1 Wednesday 12 August 2015
2 Friday April 1 2016
3 Friday April 1 2016
4 5/12/2016
5 5/12/2016
In [6]: df['newDate'] = pd.to_datetime(df['Date'])
Result:
In [7]: df
Out[7]:
Date newDate
0 Wednesday 12 August 2015 2015-08-12
1 Wednesday 12 August 2015 2015-08-12
2 Friday April 1 2016 2016-04-01
3 Friday April 1 2016 2016-04-01
4 5/12/2016 2016-05-12
5 5/12/2016 2016-05-12
You can use third party dateutil library as long as your data is not too big.( After all, It guesses format every time)
import pandas as pd
from dateutil import parser
df = pd.read_csv('test3.csv')
df['newDate'] = df['serverDatePrettyFirstAction'].apply(parser.parse)
df.to_csv('newfile.csv', index=False, date_format='%Y-%m-%d ')
to overwrite the values in the same column
Use
df['serverDatePrettyFirstAction']=df['serverDatePrettyFirstAction'].apply(parser.parse)