Pyspark: add new column derived from other and with time contition

Pyspark: add new column derived from other and with time contition - python-2.7

I have DataFrame #1 with columns A, B, C startyear, endyear With Values:
year B C startyear endyear
2010 2 A 2012 2014
2011 2 A 2010 2013
2013 2 B .. ..
2012 2 C``
I want to create a new column called result
df = df.Withcolumn(...)
Resutt will take into consideration start and end years to compute mean of B for each year between start and en dates
if start date = 2012
end date = 2014
then result will be the mean of the sum of ( B2012 + B2013+B2014) = 2+2+2/3=2
Some advice ?
Thank you

You can use filter with two conditions and then use aggregate function to calculate mean.
df = df.filter((x_df.year >= x_df.start_yr) & (x_df.year <= x_df.end_yr))
df.agg({"B":"mean"})

Related

How to create a measure for distinct category value in power bi?

category
A
B
C
A
C
B
C
A
B
A
I want only 4 Unique values Like Below
Category
A
B
C

You need a table function to do that:
Version - 1 = VALUES(Your_Table[category])
OR
Version - 2 = DISTINCT(Your_Table[category])
OR
Version - 3 = ALL(Your_Table[category])
OR
Version - 4 = SUMMARIZE(Your_Table,Your_Table[category] )

PowerBI Create List of Month Dates

Hi in powerbi I am trying to create a list of dates starting from a column in my table [COD], and then ending on a set date. Right now this is just looping through 60 months from the column start date [COD]. Can i specify an ending variable for it loop until?
List.Transform({0..60}, (x) =>
Date.AddMonths(
(Date.StartOfMonth([COD])), x))

Assuming
start=Date.StartOfMonth([COD]),
end = #date(2020,4,30),
One way is to add column, custom column with formula
= { Number.From(start) .. Number.From(end) }
then expand and convert to date format
or you could generate a list with List.Dates instead, and expand that
= List.Dates(start, Number.From(end) - Number.From(start)+1, #duration(1, 0, 0, 0))

Assuming you want start of month dates through June 2023. In the example below, I have 2023 and 6 hard coded, but this could easily come from a parameter Date.Year(DateParameter) or or column Date.Month([EndDate]).
Get the count of months with this:
12 * (2023 - Date.Year([COD]) )
+ (6 - Date.Month([COD]) )
+ 1
Then just use this column in your formula:
List.Transform({0..[Month count]-1}, (x) =>
Date.AddMonths(Date.StartOfMonth([COD]), x)
)
You could also combine it all into one harder to read formula:
List.Transform(
{0..
(12 * ( Date.Year(DateParameter) - Date.Year([COD]) )
+ ( Date.Month(DateParameter) - Date.Month([COD]) )
)
}, (x) => Date.AddMonths(Date.StartOfMonth([COD]), x)
)
If there is a chance that COD could be after the End Date, you would want to include error checking the the Month count formula.

Generate list:
let
Start = Date1
, End = Date2
, Mos = ElapsedMonths(End, Start) + 1
, Dates = List.Transform(List.Numbers(0,Mos), each Date.AddMonths(Start, _))
in
Dates
ElapsedMonths(D1, D2) function def:
(D1 as date, D2 as date) =>
let
DStart = if D1 < D2 then D1 else D2
, DEnd = if D1 < D2 then D2 else D1
, Elapsed = (12*(Date.Year(DEnd)-Date.Year(DStart))+(Date.Month(DEnd)-Date.Month(DStart)))
in
Elapsed
Of course, you can create a function rather than hard code startdate and enddate:
(StartDate as date, optional EndDate as date, optional Months as number)=>
let
Mos = if EndDate = null
then (if Months = null
then error Error.Record("Missing Parameter", "Specify either [EndDate] or [Months]", "Both are null")
else Months
)
else ElapsedMonths(StartDate, EndDate) + 1
, Dates = List.Transform(List.Numbers(0, Mos), each Date.AddMonths(StartDate, _))
in
Dates

Calculate aggregate value of last 12 months in a Measure Power BI

I'm using this measure to calculate a aggregate sum of a value in the last 12 months. The measure is working well if I start using it from the month 12. But, the problem is, if the month is not in the 12 or higher, the value is not right.
Example, if you are in the first month of the sample, I would like to multiply this value by 12 (1st month + 11 months). If it was the second month, I'd like you to average the two months and multiply it by 12. And so on.
could you please help me?
SumRevenue =
var vSumNet12 =
CALCULATE(
Table[Trevenue],
DATESINPERIOD(
CalendarM[Data],
MAX(CalendarM[Data]),
-12,
MONTH
)
)
return
vSumNet12
Example table:
Date Customer Net Trevenue SumRevenue ROA ROA I Want
09/30/20 A 237767115,6 327444,2478 327444,2478 0,14% 1,65%
10/31/20 A 245689276,3 251934,78 579379,0278 0,24% 1,41%
11/30/20 A 252916933,6 262294,89 841673,9178 0,33% 1,33%
12/31/20 A 241424127 509883,07 1351556,988 0,56% 1,68%
01/31/21 A 244721140,9 259250 1610806,988 0,66% 1,58%
02/28/21 A 250913741,4 246740,33 1857547,318 0,74% 1,48%
03/31/21 A 282215365,7 550897,35 2408444,668 0,85% 1,46%
04/30/21 A 312759343,1 544161,63 2952606,298 0,94% 1,42%
05/31/21 A 325535894 419360,97 3371967,268 1,04% 1,38%
06/30/21 A 371306315 390650,41 3762617,678 1,01% 1,22%
07/31/21 A 379780645,3 527254,43 4289872,108 1,13% 1,23%
08/31/21 A 415390274,9 409196,3 4699068,408 1,13% 1,13%
09/30/21 A 433837730,6 598924,02 4970548,18 1,15% 1,15%
10/31/21 A 482659906,7 254086,32 4972699,72 1,03% 1,03%
11/30/21 A 501568104,7 318924,53 5029329,36 1,00% 1,00%
12/31/21 A 507124350,5 754897,79 5274344,08 1,04% 1,04%
01/31/22 A 510220304,2 179153,11 5194247,19 1,02% 1,02%

Power Bi show data with multiple conditions

I am trying to help a public school here, but I have very limited knowledge in Power Bi so I hope your guys could enlight me on this case:
we have a very simple report with a table and a kpi
Kpi counts all students
table shows studants grades
Student Math Portuguese History Science
StD A 6 6 7 8
StD B 6 7 6 7
StD C 8 9 7 8
StD D 6 6 6 6
StD E 6 7 8 8
StD F 8 6 7 7
the rule that must be applied to the kpi (count(Students)) and to the table is to show studenst only if:
at least 2 subjects are equal or under 6
portuguese is equal or under 6
math is under 6
all the rest should not be showed in the table or counted in the KPI. In this case I would see/count only students A, B, D,E & F
any help would be very appreciated

To tackle your task try the following:
Create a calculated column in your table with the following DAX code:
isValid =
VAR cond_2_subjects = (('Table'[Math] <= 6 ) + ('Table'[Portuguese] <= 6) + ('Table'[History] <= 6) + ('Table'[Science] <= 6)) >= 2
VAR cond_portuguese = 'Table'[Portuguese] <= 6
VAR cond_math = 'Table'[Math] < 6
RETURN
-- This will check if any of the given conditions is true
IF(
cond_2_subjects || cond_portuguese || cond_math,
TRUE(),
FALSE()
)
The table should then look like this:
The KPI (measure) can then be written like so:
# Students =
CALCULATE(
COUNT('Table'[Student]),
-- only count Students where conditions are true (calculated column isValid = True)
'Table'[isValid] = TRUE()
)
The final result should then look like this:
The table on the left has specified 'Table'[isValid] = TRUE() as filter on visual

R: How to group and aggregate list elements using regex?

I want to aggregate (sum up) the following product list by groups (see below):
prods <- list("101.2000"=data.frame(1,2,3),
"102.2000"=data.frame(4,5,6),
"103.2000"=data.frame(7,8,9),
"104.2000"=data.frame(1,2,3),
"105.2000"=data.frame(4,5,6),
"106.2000"=data.frame(7,8,9),
"101.2001"=data.frame(1,2,3),
"102.2001"=data.frame(4,5,6),
"103.2001"=data.frame(7,8,9),
"104.2001"=data.frame(1,2,3),
"105.2001"=data.frame(4,5,6),
"106.2001"=data.frame(7,8,9))
test= list("100.2000"=data.frame(2,3,5),
"100.2001"=data.frame(4,5,6))
names <- c("A", "B", "C")
prods <- lapply(prods, function (x) {colnames(x) <- names; return(x)})
Each element of the product list (prods) has a name combination of the product number and the year (e.g. 101.2000 --> 101 = prod nr. and 2000 = year). And the groups only contain product numbers for the aggregation.
group1 <- c(101, 106)
group2 <- c(102, 104)
group3 <- c(105, 103)
My expected result, shows the aggregated product groups by year:
$group1.2000
A B C
1 8 10 12
$group2.2000
A B C
1 5 7 9
$group3.2000
A B C
1 11 13 15
$group1.2001
A B C
1 8 10 12
$group2.2001
A B C
1 5 7 9
$group3.2001
A B C
1 11 13 15
So far, I tried this way: First I decomposed the names of prods into product numbers:
prodnames <- names(prods)
prodnames_sub <- gsub("\\..*.","", prodnames)
And then I tried to aggregate using lapply:
lapply(prods, function(x) aggregate( ... , FUN = sum)
However, I didn't find how to implement the previous product numbers in the aggregation function. Ideas? Thanks

Here are two approaches. No packages are used in either one.
1) Using lists Create a two column data.frame S from the groups whose columns are the products (value column) and associated groups (ind column). Create the list to split by, By. In code to produce By, sub("\\.*", "", names(prods)) extracts the products and match is then used to find the associated group. sub("\\..*", "", names(prods)) extracts the year. Next perform the split and lapply over it to run the summations. The two components of By (group and year) can be reversed to change the order of the output, if desired.
S <- stack(list(group1 = group1, group2 = group2, group3 = group3))
By <- list(group = S$ind[match(sub("\\..*", "", names(prods)), S$values)],
year = sub(".*\\.", "", names(prods)))
lapply(split(prods, By), function(x) colSums(do.call(rbind, x)))
2) Using data.frames Convert the groups and prods each to a data frame, merge them, perform an aggregate and split back into a list. The output is the same as requested except for order. (Reverse the two right hand variables in the aggregate formula to get the order shown in the question but that will also reverse the two parts of each component name in he output list.)
S <- stack(list(group1 = group1, group2 = group2, group3 = group3))
DF0 <- do.call(rbind, prods)
DF <- cbind(do.call(rbind, strsplit(rownames(DF0), ".", fixed = TRUE)), DF0)
M <- merge(DF, S, all.x = TRUE, by = 1)
Ag <- aggregate(cbind(A, B, C) ~ ind + `2`, M, sum)
lapply(split(Ag, paste(Ag[[1]], Ag[[2]], sep = ".")), "[", 3:5)
giving:
$group1.2000
A B C
1 8 10 12
$group1.2001
A B C
4 8 10 12
$group2.2000
A B C
2 5 7 9
$group2.2001
A B C
5 5 7 9
$group3.2000
A B C
3 11 13 15
$group3.2001
A B C
6 11 13 15

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Pyspark: add new column derived from other and with time contition - python-2.7

You can use filter with two conditions and then use aggregate function to calculate mean. df = df.filter((x_df.year >= x_df.start_yr) & (x_df.year <= x_df.end_yr)) df.agg({"B":"mean"})

Related

How to create a measure for distinct category value in power bi?

PowerBI Create List of Month Dates

Calculate aggregate value of last 12 months in a Measure Power BI

Power Bi show data with multiple conditions

R: How to group and aggregate list elements using regex?

Categories

Resources