How to do column wise intersection with itertools - python-2.7

When I calculate the jaccard similarity between each of my training data of (m) training examples each with 6 features (Age,Occupation,Gender,Product_range, Product_cat and Product) forming a (m*m) similarity matrix.
I get a different outcome for matrix. I have identified the problem source but do not posses a optimized solution for the same.
Find the sample of the dataset below:
ID AGE Occupation Gender Product_range Product_cat Product
1100 25-34 IT M 50-60 Gaming XPS 6610
1101 35-44 Research M 60-70 Business Latitude lat6
1102 35-44 Research M 60-70 Performance Inspiron 5810
1103 25-34 Lawyer F 50-60 Business Latitude lat5
1104 45-54 Business F 40-50 Performance Inspiron 5410
The matrix I get is
Problem Statement:
If you see the value under the red box that shows the similarity of row (1104) and (1101) of the sample dataset. The two rows are not similar if you look at their respective columns, however the value 0.16 is because of the term "Business" present in "occupation" column of row (1104) and "product_cat" column of row(1101), which gives outcome as 1 when the intersection of the rows are taken.
My code just takes the intersection of the two rows without looking at the columns, How do I change my code to handle this case and keep the performance equally good.
My code:
half_matrix=[]
for row1, row2 in itertools.combinations(data_set, r=2):
intersection_len = row1.intersection(row2)
half_matrix.append(float(len(intersection_len)) /tot_len)

The simplest way out of this is to add a column-specific prefix to all entries. Example of a parsed row:
row = ["ID:1100", "AGE:25-34", "Occupation:IT", "Gender:M", "Product_range:50-60", "Product_cat:Gaming", "Product:XPS 6610"]
There are many other ways around this, including splitting each row into a set of k-mers and applying the Jaccard-based MinHash algorithm to compare these sets, but there is no need in such a thing in your case.

Related

Remove single column value from row total value a a PowerBi Matrix

I have a matrix that has numerous categories/columns with a total of each row. I want a specific column to remain in the matrix, but should not form part of the row total.
Dog
Cat
Chicken
Total (Excl. Chicken)
2
2
10
4
2
4
100
6
I only find ways to either remove the row total of the column totals for specific columns. Also appear that measures does not work in matrix's. Do I need to rather use a table with a added measure or is there a way with a matrix?

How to calculate ratio for every row in power bi?

Could someone please guide me through this situation?
I have a need to compute for every row the ratio between, volume of the row and sum of volumes of rows of a given transport type.
But after calculating this calculated column, I would like to use it in a table visualization, and be affected by any slicer of it...
Was I clear?
Here follows the example of rows:
Delivery Method Order Nr VOLUME Ratio
Air 50102258 9 33%
Sea 50091716 50 52%
Sea 50092425 47 48%
Air 50102257 18 67%
Here's your measure (a calculated column doesn't work here since you have to aggregate per method):
% Volume =
DIVIDE(
SUM('Table'[VOLUME]),
CALCULATE(
SUM('Table'[VOLUME]),
ALLEXCEPT('Table','Table'[Delivery Method])
)
)
If the answer was useful please accept it and give it an upvote.

Grand averaging a measure where there is missing data in PowerBI and DAX

I am trying to get my head around DAX and am struggling. I have a PowerBI Matrix in which I need to calculate the average of a measure. The measure is '% of population' and on the surface it appears to work exactly as expected.
It calculates correctly in the top matrix for the two levels and also summarises correctly in the bottom table.
As an example, I have highlighted in red the order of calculations for "A3"
For the record the % population is set to
% of Population = sum(Data[Value])/sum('Level'[Population])
The problem occurs when I filter on the Country and only select Country 2...
Country 2 does not have data for "D13". Although the Values sum up correctly (170), the Sum of the Population includes the 300 from the missing D13 row making a total of 600 and the '% population' of 28.33% (instead of 170 / 300 = 57%)
I am happy to turn off the group totals in the top grid so that the 28.33 does not show; so my real problem is actually with the bottom grid.
I think I need a new measure to be displayed in the bottom grid. I think it simply needs to sum up the values and divide by the sum of the populations - but only when the value is present. How do I do that?
Or am I totally on the wrong track and there is an obvious answer that I am missing?
The PowerBI file can be downloaded from here
Thanks in advance.
The reason this is happening is that the Country table does not filter the Level table in the relationship diagram since they both only filter one way to the Data table and there are no other relationships.
Without changing your data model, one way to fix this in DAX is to specify that you only want to count Population where Level[LevelId] matches a Data[SecondLevelId] in your current filter context.
Population =
DIVIDE (
SUM ( Data[Value] ),
CALCULATE (
SUM ( 'Level'[Population] ),
'Level'[LevelId] IN VALUES ( Data[SecondLevelId] )
)
)

DAX suggestions to generate specific weighted average measure in PowerBI?

I'm trying to formulate weighted average measures in PowerBi for survey data. I've currently got the following two tables (simplified):
Survey Data
ID How would you score the service? Do you agree with X?
---------------------------------------------------------------------------
23 Fair Agree
24 Poor Strongly disagree
25 Fair Agree
26 Very poor *blank*
27 Very good Strongly agree
Weights
Weight Score Likert
-------------------------------------------------
1 Very poor Strongly disagree
2 Poor Disagree
3 Fair Neither agree nor disagree
4 Good Agree
5 Very good Strongly Agree
There's currently a relationship between 'surveydata'[How would you score the service?] and 'weights'[weight].
The weighted average formula I'm trying to calculate is the following:
(x1 * w1) + (x2 * w2) + ... (xn * wn)
Weighted Ave = ___________________________________________
x1 + x2 + ... xn
Where:
w = weight of answer choice
x = response count for answer choice
For the example above, I would need two measures - a weighted average for 'surveydata'[How would you score the service?] and one for 'surveydata'[Do you agree with X?].
For the example above, the weighted average measure for 'surveydata'[How would you score the service?] should be 2.8.
Note that one of the complications is the fact that there are *blank* cells.
Can anyone suggest a way of doing this in PowerBI with calculated measures (or otherwise)?
You can sum the weights for each row in your SurveyData table and divide by the number rows
Weighted Ave =
DIVIDE(
SUMX(SurveyData, RELATED(Weights[Weight])),
COUNTROWS(SurveyData)
)
Or simply use an average function
Weighted Ave = AVERAGEX(SurveyData, RELATED(Weights[Weight]))

How do I get my DAX measure to calculate grouped values?

I need help with a DAX measure in my Power BI report. I am just learning about measures, so sorry if this seems like a newbie question.
Here’s my scenario:
The purpose of the report is to show water usage for various locations in a municipality. The data is coming from IOT sensors, with new entries every few minutes or so.
A sample of the data generally looks like this:
|SensorID |Timestamp |Reading
|----------|---------------------|--------
|1 |2017-06-22 12:01 AM |123.45
|1 |2017-06-22 12:15 AM |124.56
|1 |2017-06-22 12:36 AM |128.38
|2 |2017-06-22 02:12 AM |564.75
|2 |2017-06-22 02:43 AM |581.97
I have a simple DAX measure that I use to calculate water usage for each location/sensor, for the current date range selected in the report (via Timeline control):
Usage:= (MAX([Reading]) - MIN([Reading]))
This measure works great when a single location/sensor is selected. However, when I select multiple locations, the calculation is incorrect. It takes the MAX value from ALL sensors, and subtracts the MIN value from ALL sensors - rather than calculating the usage from each location and then summing the usage.
For example, given the data sample above, the correct calculation should be:
Total Usage = (128.38 - 123.45) + (581.97 - 564.75) = 22.15
Instead, it is calculating it this way:
Total Usage = (581.97 - 123.45) = 458.52
How can I get the measure to calculate the usage, grouped by the Sensor ID?
I hope this makes sense.
Thanks!
Try this:
Total Usage:= SUMX( VALUES(MyTable[SensorID]), [Usage])
VALUES(MyTable[SensorID]) function gives a list of unique SensorIDs. SUMX function then goes over that list one by one and calculates your [Usage] measure per SensorID. Then it sums up the results.
An alternative solution:
Total Usage:= SUMX( SUMMARIZE(MyTable, MyTable[SensorID]), [Usage])
It works the same way, only the list of unique sensor ids is returned by SUMMARIZE function instead of VALUES.
Results: