get mean of values that fit a specific criteria (pattern matching) - regex

I asked this question before and got a reply that solved it for me. I have a dataframe that looks like this:
id weekdays halflife
241732222300860000 Friday, Aug 31, 2012, 22 0.4166666667
241689170123309000 Friday, Aug 31, 2012, 19 0.3833333333
241686878137512000 Friday, Aug 31, 2012, 19 0.4
241651117396738000 Friday, Aug 31, 2012, 16 1.5666666667
241635163505820000 Friday, Aug 31, 2012, 15 0.95
241633401382265000 Friday, Aug 31, 2012, 15 2.3666666667
And I would like to get average half life of items that were created on Monday, then on Tuesday...etc. (My date range spans over 6 months).
To get the date values I used strptime and difftime. Also, I found the maximum halflife with max(df$halflife), how can I find which id it corresponds to?
Reproducible code:
structure(list(id = c(241732222300860416, 241689170123309056,
241686878137511936, 241651117396738048, 241635163505819648, 241633401382264832
), weekdays = c("Friday, Aug 31, 2012, 22", "Friday, Aug 31, 2012, 19",
"Friday, Aug 31, 2012, 19", "Friday, Aug 31, 2012, 16", "Friday, Aug 31, 2012, 15",
"Friday, Aug 31, 2012, 15"), halflife = structure(c(0.416666666666667,
0.383333333333333, 0.4, 1.56666666666667, 0.95, 2.36666666666667
), class = "difftime", units = "mins")), .Names = c("id",
"weekdays", "halflife"), row.names = c(NA, 6L), class = "data.frame")
So now, I have an average half life value for all mondays, tuesdays...etc. How can I get the average value for all hours within those weekdays, i.e.: Average half life of all items that were created on all Mondays at 9am, then 10am, then 11am..etc. And then Tuesday at 9am, 10am, 11am..etc. The dates in the weekdays column is formatted so that the last number after the comma is the hour it was created at. I am really bad with regular expressions and pattern matching, which is why I am asking this follow-up question.

with base packages you can do following.
> mydf
id weekdays halflife
1 2.417322e+17 Friday, Aug 31, 2012, 22 0.4166667 mins
2 2.416892e+17 Friday, Aug 31, 2012, 19 0.3833333 mins
3 2.416869e+17 Friday, Aug 31, 2012, 19 0.4000000 mins
4 2.416511e+17 Friday, Aug 31, 2012, 16 1.5666667 mins
5 2.416352e+17 Friday, Aug 31, 2012, 15 0.9500000 mins
6 2.416334e+17 Friday, Aug 31, 2012, 15 2.3666667 mins
Instead of using regex, we can just use strsplit on each element of weekdays, unlist the result, and it back in 4 column format as matrix and cbind it back with mydf.
> mydf2 <- cbind(mydf, matrix(unlist(sapply(mydf$weekdays, strsplit, split=',')), byrow=TRUE, ncol=4, dimnames=list(1:nrow(mydf), c('Weekday', 'Day', 'Year', 'Hour'))))
> mydf2
id weekdays halflife Weekday Day Year Hour
1 2.417322e+17 Friday, Aug 31, 2012, 22 0.4166667 mins Friday Aug 31 2012 22
2 2.416892e+17 Friday, Aug 31, 2012, 19 0.3833333 mins Friday Aug 31 2012 19
3 2.416869e+17 Friday, Aug 31, 2012, 19 0.4000000 mins Friday Aug 31 2012 19
4 2.416511e+17 Friday, Aug 31, 2012, 16 1.5666667 mins Friday Aug 31 2012 16
5 2.416352e+17 Friday, Aug 31, 2012, 15 0.9500000 mins Friday Aug 31 2012 15
6 2.416334e+17 Friday, Aug 31, 2012, 15 2.3666667 mins Friday Aug 31 2012 15
Now we have split weekdays column appropriately, we can use aggregate function to calculate mean over desired grouping columns.
> aggregate(halflife ~ Weekday, data=mydf2, FUN = mean)
Weekday halflife
1 Friday 1.013889
If you want to group by Weekday as well as Hour then
> aggregate(halflife ~ Weekday + Hour, data=mydf2, FUN = mean)
Weekday Hour halflife
1 Friday 15 1.6583333
2 Friday 16 1.5666667
3 Friday 19 0.3916667
4 Friday 22 0.4166667
As such first parameter of aggregate function here is a forumla object which supports one ~ one, one ~ many, many ~ one, and many ~ many relationships. See ?aggregate examples to understand how to use it.
I will give brief example of how to many to many relationships.
> set.seed(12345)
> mydf2 <- cbind(mydf2, newvar = rnorm(nrow(mydf2)))
> mydf2
id weekdays halflife Weekday Day Year Hour newvar
1 2.417322e+17 Friday, Aug 31, 2012, 22 0.4166667 mins Friday Aug 31 2012 22 0.5855288
2 2.416892e+17 Friday, Aug 31, 2012, 19 0.3833333 mins Friday Aug 31 2012 19 0.7094660
3 2.416869e+17 Friday, Aug 31, 2012, 19 0.4000000 mins Friday Aug 31 2012 19 -0.1093033
4 2.416511e+17 Friday, Aug 31, 2012, 16 1.5666667 mins Friday Aug 31 2012 16 -0.4534972
5 2.416352e+17 Friday, Aug 31, 2012, 15 0.9500000 mins Friday Aug 31 2012 15 0.6058875
6 2.416334e+17 Friday, Aug 31, 2012, 15 2.3666667 mins Friday Aug 31 2012 15 -1.8179560
> aggregate(cbind(newvar,halflife) ~ Weekday + Hour, data=mydf2, FUN = mean)
Weekday Hour newvar halflife
1 Friday 15 -0.6060343 1.6583333
2 Friday 16 -0.4534972 1.5666667
3 Friday 19 0.3000814 0.3916667
4 Friday 22 0.5855288 0.4166667

Related

Power Query/DAX to calculate monthly raw sales figure

Dear stackoverflow, please help!
I'm hoping for some assistance with data processing in Power BI, either using Power Query or DAX. At this point I am really stuck and can't figure out how to solve this problem.
The below table is a list of sales by Product, Month, and Year. The problem with my data is that the value in the sales data is actually cumulative, rather than the raw figure of sales for that month. In other words, the figure is the sum of the number of sales for the month (for that Year and Product combination) and the number of sales for the preceding month. As you will see in the table below, the number gets progressively larger in each category as the year progresses. The true number of sales for TVs in Feb of 2021, for example, is the sales figure of 3 minus the corresponding figure for sales of TVs in Jan of 2021 (1).
I really would appreciate if anyone knows of a solution to this problem. In reality, my table has hundreds of thousands of rows, so I cannot do the calculations manually.
Is there a way to use Power Query or DAX to create a calculated column with the Raw Sales figure for each month? Something that would check if Product and Year are equal, then subtract the Jan figure from the Feb figure and so on?
Any help will be very much appreciated,
Sales Table
Product
Sales (YTD)
Month
Year
TV
1
Jan
2021
Radio
4
Jan
2021
Cooker
5
Jan
2021
TV
3
Feb
2021
Radio
5
Feb
2021
Cooker
6
Feb
2021
TV
3
Mar
2021
Radio
6
Mar
2021
Cooker
8
Mar
2021
TV
5
Apr
2021
Radio
7
Apr
2021
Cooker
8
Apr
2021
TV
7
May
2021
Radio
8
May
2021
Cooker
8
May
2021
TV
9
Jun
2021
Radio
10
Jun
2021
Cooker
10
Jun
2021
TV
10
Jul
2021
Radio
10
Jul
2021
Cooker
10
Jul
2021
TV
11
Aug
2021
Radio
13
Aug
2021
Cooker
12
Aug
2021
TV
11
Sep
2021
Radio
13
Sep
2021
Cooker
12
Sep
2021
TV
12
Oct
2021
Radio
14
Oct
2021
Cooker
13
Oct
2021
TV
17
Nov
2021
Radio
19
Nov
2021
Cooker
17
Nov
2021
TV
19
Dec
2021
Radio
20
Dec
2021
Cooker
20
Dec
2021
TV
4
Jan
2022
Radio
2
Jan
2022
Cooker
3
Jan
2022
TV
5
Feb
2022
Radio
3
Feb
2022
Cooker
5
Feb
2022
Thanks, Jim
Give this a try in powerquery / M. It groups on Product and Year, then sorts the months, and subtracts each row from the next row to determine the period amount.
let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
#"Grouped Rows" = Table.Group(Source, {"Product", "Year"}, {
{"data", each
let r=Table.Sort(Table.AddIndexColumn(_, "Index", 0, 1),{ each List.PositionOf({"Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"}, [Month]), {"Month",Order.Ascending}}),
x= Table.AddColumn( r, "Period Sales", each if [Index]=0 then [#"Sales (YTD)"] else [#"Sales (YTD)"]-r{[Index]-1}[#"Sales (YTD)"])
in x
, type table }
}),
#"Expanded data" = Table.ExpandTableColumn(#"Grouped Rows", "data", {"Sales (YTD)", "Month", "Period Sales"}, {"Sales (YTD)", "Month", "Period Sales"})
in #"Expanded data"

How to have a measure always show the ratio as percentage in scorecard/textbox POWER BI

There are many method of having a measure to show percentage in a column of table ,
but cannot find a method to always show the ratio of a SPECIFIC group in percentage between two category.
data sample:
YEAR MONTH TYPE AMOUNT
2020 Jan A 100
2020 Feb A 250
2020 Mar A 230
2020 Jan B 158
2020 Feb B 23
2020 Mar B 46
2019 Jan A 499
2019 Feb A 65
2019 Mar A 289
2019 Jan B 465
2019 Feb B 49
2019 Mar B 446
2018 Jan A 13
2018 Feb A 97
2018 Mar A 26
2018 Jan B 216
2018 Feb B 264
2018 Mar B 29
2018 Jan A 314
2018 Feb A 659
2018 Mar A 226
2018 Jan B 469
2018 Feb B 564
2018 Mar B 164
My Goal is always show the percentage of A compare with the total amount
YEAR and MONTH are used to synchronize with slicer.
e.g. I select YEAR = 2020 , MONTH = Jan
100/258 = 38%
Manually inputted in textbox
First, Create these following 3 measures in your table-
1.
amount_A =
CALCULATE(
SUM(pie_chart_data[AMOUNT]),
FILTER(
ALLSELECTED(pie_chart_data),
pie_chart_data[TYPE] = "A"
)
)
2.
amount_overall =
CALCULATE(
SUM(pie_chart_data[AMOUNT]),
ALLSELECTED(pie_chart_data)
)
3.
amount_A_percentage = [amount_A]/[amount_overall]
Now, add both measure amount_A and amount_overall to your donut chart's values column. And place the amount_A_percentage measure to a Card and place the card in center of the Donut chart. The presentation will be as below finally-

Create Week Variable Based on Date Time Variable

Simple question, I think.
I have a checkin_date_time variable in a database with thousands of unique records.
Database
ID checkin_date_time
1 January 01, 2019 11:36:50
2 January 01, 2019 11:36:55
....
60000 December 31, 2019 11:36:50
60001 December 31, 2019 11:36:55
I would like to create a 'week' variable based on the checkin_date_time variable. So for example 'January 01, 2019 11:36:55' would equal week 1 and 'December 31, 2019 15:16:57' would equal week 52.
Desired Output
ID datetime Week
1 January 01, 2019 11:36:50 1
2 January 01, 2019 11:36:55 1
....
60000 December 31, 2019 11:36:50 52
60001 December 31, 2019 11:36:55 52
I tried using the following code but its saying my
data testl;
set ed_tat;
week=week(checkin_date_time);
run;
NOTE: Missing values were generated as a result of performing an operation on missing values.
Each place is given by: (Number of times) at (Line):(Column).
Week operates on a date variable, use DATEPART() to get the date first and then determine the week.
week = week(datepart(checkin_date_time));

display each day's date python from today

I was able to display the week that starts every Saturday by:
today = now().date()
sat_offset = (today.weekday() - 5) % 7
week_start = today - datetime.timedelta(days=sat_offset)
This will display the week from last Saturday but how would I show the dates of each day forward as well? So if the week: Oct. 27, 2018 is display it should say:
Saturday : Oct. 27, 2018
Sunday: Oct. 28, 2018
Monday: Oct. 29, 2018
Tuesday: Oct. 30, 2018
Wednesday: Oct. 31, 2018
Thursday: Nov. 01, 2018
Friday: Nov. 02, 2018
Thank you for your help.
You can iterate through the days of the week using range and time delta like so:
for i in range(7):
week_start += datetime.timedelta(days=1)
print(week_start.strftime("%A %d. %B %Y"))
This will produce a dates like:
Monday : Oct. 28, 2018
Tuesday : Oct. 29, 2018
Wednesday : Oct. 30, 2018
Thursday : Oct. 31, 2018
Friday : Nov. 01, 2018
Saturday : Nov. 02, 2018
Sunday : Nov. 03, 2018
You can format the string how ever you want. Here is some info on dates in python.

javascript regular expression: how do I find date without year or date with year<2010

I need to find date without year, or date with year<2010.
basically,
Feb 15
Feb 20
Feb 20, 2009
Feb 20, 1995
should be accepted
Feb 20, 2010
Feb 20, 2011
should be rejected
How do I do it?
Thanks,
Cheng
Try this:
(Jan|Feb|Mar...Dec)\s\d{1,2},\s([1][0-9][0-9][0-9]|200[0-9])
Note: Expand the month list with proepr names. I was too lazy to spell it all out.