Filter out outliers dynamically using PERCENTILE - powerbi

I'm building a sales dashboard in PowerBI.
I have a Sales table.
My source of data is declarative, so I have a few extreme values caused by human errors and mistypes, etc.
Let's say I want to build a histogram with:
On the X axis, the stock aging of any sales. Which is "how long the product has been in stock at the time of sale". It is given by the [Product_Age] column
On values, the number of sales.
What I want to do is exclude the top 1% extreme values from my calculations (average, etc.) and vizualisations.
I've created a measure :
SalesByAge_Adjusted =
VAR TEMP =
FILTER(
SALES;
VAR StockAgingMAX =
PERCENTILE.INC(
SALES[Sales_Age];
0,99
)
RETURN
SALES[Sales_Age] < StockAgingMAX
)
RETURN
COUNTROWS(TEMP)
It uses PERCENTILE.INC to get the 99th percentile of Sales_Age values in the current context and I try to use it as a filter.
However, it just won't work.
I can diplay the measure on its own. How many sales I have. But as soon as I drag and drop "Sales_Age" to summarize the values. It shows nothing.

I have created the following table as an example.
+-------+--------+
| Axis | Values |
+-------+--------+
| 1 | 1067 |
| 2 | 1725 |
| 4 | 298 |
| 8 | 402 |
| 16 | 1848 |
| 32 | 1395 |
| 64 | 1116 |
| 128 | 1027 |
| 256 | 1948 |
| 512 | 790 |
| 1024 | 2173 |
| 2048 | 2025 |
| 4096 | 104 |
| 8192 | 1243 |
| 16384 | 1676 |
| 32768 | 1285 |
| 65536 | 806 |
+-------+--------+
For filtering the values that are out the 99% percentile I've created the following measure. Basically it gets an overall percentile without filter context and compares to each Axis value.
Filter = IF(CALCULATE(PERCENTILE.INC('Table'[Axis],0.99),ALL('Table'))>=MAX('Table'[Axis]),1,0)
In the visual of the chart, you use the filter measure to exclude your outliers
In this case, it will filter the last value of table: 65,536

Related

How do I repeat row labels in a matrix?

I have data showing me the dates grouped like this:
For security reasons, I had to remove the Customer Description detail, due to confidentiality.
How do I repeat the date column the same way you repeat the Row Labels in an Excel Pivot?
I've looked, but couldn't find a solution to this - this option should be available.
EDIT
When you have the following source data in Excel:
Date | Customer | Item Description | Qty Out | Unit Price | Sales
--------------------------------------------------------------------------------------------------------------------------------------------
14/08/2020 | Customer 1 | Item 11 | 4.00 | 65.00 | 260.00
14/08/2020 | Customer 2 | Item 12 | 56.00 | 12.00 | 672.00
14/08/2020 | Customer 3 | Item 13 | 64.00 | 35.00 | 2,240.00
14/08/2020 | Customer 4 | Item 14 | 29.00 | 65.00 | 1,885.00
15/08/2020 | Customer 2 | Item 15 | 746.00 | 12.00 | 8,952.00
15/08/2020 | Customer 3 | Item 16 | 14.00 | 75.00 | 1,050.00
15/08/2020 | Customer 4 | Item 17 | 45.00 | 741.00 | 33,345.00
15/08/2020 | Customer 5 | Item 18 | 456.00 | 125.00 | 57,000.00
15/08/2020 | Customer 6 | Item 19 | 925.00 | 17.00 | 15,725.00
16/08/2020 | Customer 4 | Item 20 | 6.00 | 532.00 | 3,192.00
16/08/2020 | Customer 5 | Item 21 | 56.00 | 94.00 | 5,264.00
16/08/2020 | Customer 6 | Item 22 | 546.00 | 37.00 | 20,202.00
You then pivot this data using Microsoft Excel, where you get the following:
You then choose the option to Repeat Item Labels as can be seen below:
After selecting this, you get my expected results I require in Power BI:
Is there not a function available like this in Power BI?
Just adding this for your reference as a work around. Check this below image with a custom column created in the Power Query Editor-
date_customer = Date.ToText([Date]) &" : "& [Customer]
Then added both Date and date_customer in the Matrix row level. The output is as below- (using your sample data)
ANOTHER OPTION Another option is to add Date and Customer in the Matrix row and the output is will be as below- (using your sample data)
This is also a meaningful output as date are showing as a group header. But in case of requirement of having redundant date to show, you can consider the first option.

Sum where version is highest by another variable (no max version in the whole data)

I'm struggling having this measure to work.
I would like to have a measure that will sum the Value only for the max version of each house.
So following this example table:
|---------------------|------------------|------------------|
| House_Id | Version_Id | Value |
|---------------------|------------------|------------------|
| 1 | 1 | 1000 |
|---------------------|------------------|------------------|
| 1 | 2 | 2000 |
|---------------------|------------------|------------------|
| 2 | 1 | 3000 |
|---------------------|------------------|------------------|
| 3 | 1 | 5000 |
|---------------------|------------------|------------------|
The result of this measure should be: 10.000 because the house_id 1 version 1 is ignored as there's another version higher.
By House_id the result should be:
|---------------------|------------------|
| House_Id | Value |
|---------------------|------------------|
| 1 | 2000 |
|---------------------|------------------|
| 1 | 3000 |
|---------------------|------------------|
| 2 | 5000 |
|---------------------|------------------|
Can anyone help me?
EDIT:
Given the correct answer #RADO gave, now I want to further enhance this measure:
Now, my main Data table in reality has more columns.
What if I want to add this measure to a table visual that splits the measure by another column from (or related to) the Data table.
For example (simplified data table):
|---------------------|------------------|------------------|------------------|
| House_Id | Version_Id | Color_Id | Value |
|---------------------|------------------|------------------|------------------|
| 1 | 1 | 1 (Green) | 1000 |
|---------------------|------------------|------------------|------------------|
| 1 | 2 | 2 (Red) | 2000 |
|---------------------|------------------|------------------|------------------|
| 2 | 1 | 1 (Green) | 3000 |
|---------------------|------------------|------------------|------------------|
| 3 | 1 | 1 (Green) | 5000 |
|---------------------|------------------|------------------|------------------|
There's a Color_Id in the main table that is connected to a Color table.
Then I add a visual table with ColorName (from the ColorTable) and the measure (ColorId 1 is Green, 2 is Red).
With the given answer the result is wrong when filtered by ColorName. Although the Total row is indeed correct:
|---------------------|------------------|
| ColorName | Value |
|---------------------|------------------|
| Green | 9000 |
|---------------------|------------------|
| Red | 2000 |
|---------------------|------------------|
| Total | 10000 |
|---------------------|------------------|
This result is wrong per ColorName as 9000 + 2000 is 11000 and not 10000.
The measure should ignore the rows with an old version. In the example before this is the row for House_Id 1 and Color_Id Green because the version is old (there's a newer version for that House_Id).
So:
How can I address this situation?
What If I want to filter by another column from (or related to) the Data table such as Location_Id? It is posible to define the measure in such a way that could work for any given number splits for columns in the main Data table?
I use "Data" as a name of your table.
Sum of Latest Values =
VAR Latest_Versions =
SUMMARIZE ( Data, Data[House_id], "Latest_Version", MAX ( Data[Version_Id] ) )
VAR Latest_Values =
TREATAS ( Latest_Versions, Data[House_id], Data[Version_Id] )
VAR Result =
CALCULATE ( SUM ( Data[Value] ), Latest_Values )
RETURN Result
Measure output:
How it works:
We calculate a virtual table of house_ids and their max versions, and store it in a variable "Latest_Versions"
We use the table from the first step to filter data for the latest versions only, and establish proper data lineage
(https://www.sqlbi.com/articles/understanding-data-lineage-in-dax/)
We calculate the sum of latest values by filtering data for the latest values only.
You can learn more about this pattern here:
https://www.sqlbi.com/articles/propagate-filters-using-treatas-in-dax/

DynamoDB with daily/weekly/monthly aggregated values

My application is creating a log file every 10min, which I want to store in DynamoDB in an aggregated way, e.g. 144 log files per day, 1008 log files per week or ~4400 log files per month.
I have different partition keys, but for sake of simplicity I have used only a single partition key in the following examples.
The straight forward solution would be to have different tables, e.g.
Table "TenMinLogsDay":
id (=part.key) | date (=sort key) | cntTenMinLogs | data
-------------- | ---------------- | ------------- | -------------------------------
1 | 2017-04-30 | 144 | some serialized aggregated data
1 | 2017-05-01 | 144 | some serialized aggregated data
1 | 2017-05-02 | 144 | some serialized aggregated data
1 | 2017-05-03 | 144 | some serialized aggregated data
Table "TenMinLogsWeek":
id (=part.key) | date (=sort key) | cntTenMinLogs | data
-------------- | ---------------- | ------------- | -------------------------------
1 | 2017-05-01 | 1008 | some serialized aggregated data
1 | 2017-05-08 | 1008 | some serialized aggregated data
1 | 2017-05-15 | 1008 | some serialized aggregated data
Table "TenMinLogsMonth":
id (=part.key) | date (=sort key) | cntTenMinLogs | data
-------------- | ---------------- | ------------- | -------------------------------
1 | 2017-05-01 | 4464 | some serialized aggregated data
1 | 2017-06-01 | 4320 | some serialized aggregated data
1 | 2017-07-01 | 4464 | some serialized aggregated data
I would prefer however a combined table. Out of the box DynamoDB does not seem to support this.
Also, I want to query either the daily OR the weekly OR the monthly aggregated items, thus I don't want to use the filter feature for this.
The following solution would be possible, but seems like a poor hack:
Table "TenMinLogsCombined":
id (=part.key) | date (=sort key) | week (=LSI sort key) | month (=LSI sort key) | cntTenMinLogs | data
-------------- | ---------------- | -------------------- | --------------------- | ------------- | -----
1 | 2017-04-30 | (empty) | (empty) | 144 | ...
1 | 2017-05-01 | (empty) | (empty) | 144 | ...
1 | 0017-05-01 | 2017-05-01 | (empty) | 1008 | ...
1 | 1017-05-01 | (empty) | 2017-05-01 | 4464 | ...
1 | 2017-05-02 | (empty) | (empty) | 144 | ...
1 | 2017-05-03 | (empty) | (empty) | 144 | ...
Explanation:
By using the year "0017" and "1017" instead of "2017" I can query the date range for, e.g. 2017-05-01 to 2017-05-04 and DynamoDB won't read the items starting with 0017 or 1017
For week or month range queries, such a hack is not required, as empty LSI sort keys are possible.
Does anybody know of a better way to achieve this?

Rescale Dataset using Power BI

I'm trying to rescale a dataset in using PowerBI Desktop. I've imported a dataset full of raw data, but I can't use row context together with an aggregate. I'm trying to accomplish this:
Data:
+---------+-----+
| Name | Bar |
+---------+-----+
| Alfred | 0 |
| Alfred | -1 |
| Alfred | 1 |
| Burt | 1 |
| Burt | 0 |
| Charlie | 1 |
| Charlie | 1 |
| Charlie | 0 |
+---------+-----+
Calculations:
Foo: = SUM(Bar) / COUNT(Bar) GROUP BY Name
Which would Generate this dataset:
+---------+-----+
| Name | Foo |
+---------+-----+
| Alfred | 0 |
| Burt | .5 |
| Charlie | .67 |
+---------+-----+
Final Calculation:
Score: = (#Foo - MIN(Foo)) / (MAX(Foo)-MIN(Foo))
The goal is to grade on a curve with a set of data. I can do it in excel, but was hoping that Power BI could handle all the heavy lifting.
At this point it might be easier to do it all in SQL before bringing it into PowerBI, but that would make it significantly less dynamic (with date filters and the like). Thanks for any insight you might have!
I think you're looking for the GROUPBY DAX function. https://support.office.com/en-us/article/GROUPBY-Function-DAX-d6d064b2-fd8b-4c1b-97f8-c6d03cdf8ad0
You then would GROUPBY on the Name field and proceed from there. If need to use the measure outside of a visual that groups by each Name (like show me the average score after applying the curve), then you'll need to wrap that in a calculate table where you include the names, your measure projected as a column, and then do your aggregates (min/max/average) over that calculated table.

How can I sum up some values per page in a table in XSL-FO?

I'm using XSL-FO to generate an account statement print out. The PDF is actually just a simple table with a simple header on every page. The difficulty is that I have to display transaction volumes per page, e.g.
Page 1
+------------------------------+-----------+-----------+---------------------+
| Text | Credit | Debit | Balance |
+------------------------------+-----------+-----------+---------------------+
| Previous month | | | (*1) 1000 |
| abc | 1000 | | 2000 |
| abc | | 500 | 1500 |
| abc | | 200 | 1300 |
| ... | | | |
| Carry over | (*2) 1000 | (*3) 700 | (*4) 1300 |
+------------------------------+-----------+-----------+---------------------+
Page 2
+------------------------------+-----------+-----------+---------------------+
| Text | Credit | Debit | Balance |
+------------------------------+-----------+-----------+---------------------+
| Previous page | (*2) 1000 | (*3) 700 | (*4) 1300 |
| abc | 1000 | | 2300 |
| abc | | 500 | 1800 |
| abc | | 200 | 1600 |
| ... | | | |
| Carry over | (*2) 2000 | (*3) 1400 | (*4) 1600 |
+------------------------------+-----------+-----------+---------------------+
Here are some explanations:
This is the previous month's balance. It's pre-calculated and well-known as an XSL variable. No problem with that, that's a regular header (only on the first page)
This value is calculated on a per-page basis. It sums up all credit amounts on the same page. I can't calculate that myself, as I don't know when XSL-FO will do the page break. So I imagine XSL-FO must do the calculation for me. The sum at the bottom of a page is the same as the value at the top of the subsequent page.
This value is the same as 2, only for debit amounts.
This value is just the last transaction's balance at the bottom of a page. That value is repeated at the top of the next page.
How can I do these calculations with XSL-FO?
See also this related question: How to display one or the other information depending on the page number in XSL-FO?
Try "table markers": http://www.w3.org/TR/xsl/#fo_retrieve-table-marker.
In XSLT for each row inject a marker with the sum. Then let the engine select a marker to substitute for the fo:retrieve-table-marker in table header or footer. The idea is that proper marker will be selected at rendering time depending on the marker's position on the page and #retrieve-position and #retrieve-boundary on the fo:retrieve-table-marker.
Unfortunately, (at the time when I answered this question, it's no longer true) fop doesn't implement <fo:retrieve-table-marker/> from what I have found out. Instead, this solution here worked for me:
How to display one or the other information depending on the page number in XSL-FO?
It involves creating a separate table outside of the <fo:flow/> that displays the table header using <fo:retrieve-marker/> elements.