Transform categorical column into dummy columns using Power Query M - powerbi

Using Power Query "M" language, how would you transform a categorical column containing discrete values into multiple "dummy" columns? I come from the Python world and there are several ways to do this but one way would be below:
>>> import pandas as pd
>>> dataset = pd.DataFrame(list('ABCDACDEAABADDA'),
columns=['my_col'])
>>> dataset
my_col
0 A
1 B
2 C
3 D
4 A
5 C
6 D
7 E
8 A
9 A
10 B
11 A
12 D
13 D
14 A
>>> pd.get_dummies(dataset)
my_col_A my_col_B my_col_C my_col_D my_col_E
0 1 0 0 0 0
1 0 1 0 0 0
2 0 0 1 0 0
3 0 0 0 1 0
4 1 0 0 0 0
5 0 0 1 0 0
6 0 0 0 1 0
7 0 0 0 0 1
8 1 0 0 0 0
9 1 0 0 0 0
10 0 1 0 0 0
11 1 0 0 0 0
12 0 0 0 1 0
13 0 0 0 1 0
14 1 0 0 0 0

Interesting question. Here's an easy, scalable method I've found:
Create a custom column of all ones (Add Column > Custom Column > Formula = 1).
Add an index column (Add Column > Index Column).
Pivot on the custom column (select my_col > Transform > Pivot Column).
Replace null values with 0 (select all columns > Transform > Replace Values).
Here's what the M code looks like for this process:
#"Added Custom" = Table.AddColumn(#"Previous Step", "Custom", each 1),
#"Added Index" = Table.AddIndexColumn(#"Added Custom", "Index", 0, 1),
#"Pivoted Column" = Table.Pivot(#"Added Index", List.Distinct(#"Added Index"[my_col]), "my_col", "Custom"),
#"Replaced Value" = Table.ReplaceValue(#"Pivoted Column",null,0,Replacer.ReplaceValue,Table.ColumnNames(#"Pivoted Column"))
Once you've completed the above, you can remove the index column if desired.

Related

How to display missing dates in the data table Power BI

My data is all about vehicle movement. My current data is only show car that move on some particular date only. For example, if the car is not moving, the data with corresponding date is not stored. In this table, it shows that the car move on some particular date but some date are missing/not showing:
Date Count ASSET_ID Mileage
*****************************************************
1/7/2021 1 200
4/7/2021 1 32
18/7/2021 1 100
After the modifications, I would like to display the all date and the date that the car not moving is stored as zero. So I can count how many data for car not move in 1 month. Here the example table after the modifications that I want:
Date Count ASSET_ID Mileage
****************************************************
1/7/2021 1 200
2/7/2021 0 0
3/7/2021 0 0
4/7/2021 1 32
5/7/2021 0 0
6/7/2021 0 0
7/7/2021 0 0
8/7/2021 0 0
9/7/2021 0 0
10/7/2021 0 0
11/7/2021 0 0
12/7/2021 0 0
13/7/2021 0 0
14/7/2021 0 0
15/7/2021 0 0
16/7/2021 0 0
17/7/2021 0 0
18/7/2021 1 100
19/7/2021 0 0
20/7/2021 0 0
21/7/2021 0 0
22/7/2021 0 0
23/7/2021 0 0
24/7/2021 0 0
25/7/2021 0 0
26/7/2021 0 0
27/7/2021 0 0
28/7/2021 0 0
29/7/2021 0 0
30/7/2021 0 0
31/7/2021 0 0
Adding a data table would help you tremendously here.
You can use the following one (Or edit it as you please):
dimDate = ADDCOLUMNS (
CALENDAR (DATE(2021,1,1),DATE(2025,12,31)),
"DateNumber", FORMAT([Date],"YYYYMMDD"),
"Year", YEAR([Date]),
"MonthNo",FORMAT([Date],"MM"),
"YearMonthNo", FORMAT([Date], "YYYY/MM"),
"YearMonthShort", FORMAT([Date], "YYYY/mmm"),
"MonthShort", FORMAT([Date], "mmm"),
"Month", FORMAT([Date], "mmmm"),
"DayNo", WEEKDAY([Date]),
"Day", FORMAT([Date], "dddd"),
"DayShort", FORMAT([Date], "ddd"),
"Quarter", FORMAT([Date], "Q"),
"YearQuarter", FORMAT([Date], "YYYY") & "/Q" & FORMAT([Date], "Q"))
After adding the date table and connecting it in your data model, you can use measures to calculate what days the vehicle moved / did not move.

Power BI DAX - Grouping rows when a value is found in row

I have the below table. I need to group them base on product and increment group number when set = 1 but returns back to 1 if new product is in next line. I have created an index already.
Index
Product
Set
1
Table
0
2
Table
0
3
Table
1
4
Table
0
5
Table
0
6
Table
1
7
Table
0
8
Table
1
9
Chair
0
10
Chair
0
11
Chair
0
12
Chair
1
13
Chair
0
14
Chair
0
15
Chair
1
Here's the result I'm after:
Index
Product
Set
Group
1
Table
0
1
2
Table
0
1
3
Table
1
1
4
Table
0
2
5
Table
0
2
6
Table
1
2
7
Table
0
3
8
Table
1
3
9
Chair
0
1
10
Chair
0
1
11
Chair
0
1
12
Chair
1
1
13
Chair
0
2
14
Chair
0
2
15
Chair
1
2
With this
Grouping=
RANKX (
FILTER (
'fact',
'fact'[Set] <> 0
&& EARLIER ( 'fact'[Product] ) = 'fact'[Product]
),
'fact'[Index],
,
ASC

SAS code - sum of last N rows for every row

I have a dataset like this for each ID;
Months
ID
Number
2018-07-01
1
0
2018-08-01
1
0
2018-09-01
1
1
2018-10-01
1
3
2018-11-01
1
1
2018-12-01
1
2
2019-01-01
1
0
2019-02-01
1
0
2019-03-01
1
1
2019-04-01
1
0
2019-05-01
1
0
2019-06-01
1
0
2019-07-01
1
1
2019-08-01
1
0
2019-09-01
1
0
2019-10-01
1
2
2019-11-01
1
0
2019-12-01
1
0
2020-01-01
1
0
2020-02-01
1
0
2020-03-01
1
0
2020-04-01
1
0
2020-05-01
1
0
2020-06-01
1
0
2020-07-01
1
0
2020-08-01
1
1
2020-09-01
1
0
2020-10-01
1
0
2020-11-01
1
1
2020-12-01
1
0
2021-01-01
1
0
2021-02-01
1
1
2021-03-01
1
1
2021-04-01
1
0
2018-07-01
2
0
.......
.......
.......
(Similar values for each ID)
I want a dataset like this;
Months
ID
Number
Sum_Next_6Number
2018-07-01
1
0
7
2018-08-01
1
0
7
2018-09-01
1
1
7
2018-10-01
1
3
4
2018-11-01
1
1
3
2018-12-01
1
2
1
2019-01-01
1
0
2
2019-02-01
1
0
2
2019-03-01
1
1
1
2019-04-01
1
0
3
2019-05-01
1
0
3
2019-06-01
1
0
3
2019-07-01
1
1
2
2019-08-01
1
0
2
2019-09-01
1
0
2
2019-10-01
1
2
0
2019-11-01
1
0
0
2019-12-01
1
0
0
2020-01-01
1
0
0
2020-02-01
1
0
1
2020-03-01
1
0
1
2020-04-01
1
0
1
2020-05-01
1
0
2
2020-06-01
1
0
2
2020-07-01
1
0
2
2020-08-01
1
1
2
2020-09-01
1
0
3
2020-10-01
1
0
3
2020-11-01
1
1
Nan
2020-12-01
1
0
Nan
2021-01-01
1
0
Nan
2021-02-01
1
1
Nan
2021-03-01
1
1
Nan
2021-04-01
1
0
Nan
2018-07-01
2
0
0
.......
.......
.......
.......
If there is no 6 months left then this values should be Nan.
Is there a way to do this? Thank you in advance.
data want(drop = i n);
set have curobs = c nobs = nobs;
Sum_Next_6Numbers = 0;
do p = c + 1 to 6 + c;
if p > nobs then do;
Sum_Next_6Numbers = .; leave;
end;
set have(keep = Number ID rename = (Number = n id = i)) point = p;
if id ne i then do;
Sum_Next_6Numbers = .; leave;
end;
Sum_Next_6Numbers + n;
end;
run;

Create boolean dataframe showing existance of each element in a dictionary of lists

I have a dictionary of lists and I have constructed a dataframe where the index is the dictionary keys and the columns are the set of possible values contained within the lists. The dataframe values represent existance of each column for each list contained in the dictionary. What is the most efficient way to construct this? Below is the way I have done it now using for loops, but I am sure there is a more efficient way using either vectorization or concatenation.
import pandas as pd
data = {0:[1,2,3,4],1:[2,3,4],2:[3,4,5,6]}
cols = sorted(list(set([x for y in data.values() for x in y])))
df = pd.DataFrame(0,index=data.keys(),columns=cols)
for row in df.iterrows():
for col in cols:
if col in data[row[0]]:
df.loc[row[0],col] = 1
else:
df.loc[row[0],col] = 0
print(df)
Output:
1 2 3 4 5 6
0 1 1 1 1 0 0
1 0 1 1 1 0 0
2 0 0 1 1 1 1
Use MultiLabelBinarizer:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(data.values()),
columns=mlb.classes_,
index=data.keys())
print (df)
1 2 3 4 5 6
0 1 1 1 1 0 0
1 0 1 1 1 0 0
2 0 0 1 1 1 1
Pure pandas, but much slowier solution with str.get_dummies:
df = pd.Series(data).astype(str).str.strip('[]').str.get_dummies(', ')

Convert this Word DataFrame into Zero One Matrix Format DataFrame in Python Pandas

Want to convert user_Id and skills dataFrame matrix into zero one DataFrame matrix format user and their corresponding skills
Input DataFrame
user_Id skills
0 user1 [java, hdfs, hadoop]
1 user2 [python, c++, c]
2 user3 [hadoop, java, hdfs]
3 user4 [html, java, php]
4 user5 [hadoop, php, hdfs]
Desired Output DataFrame
user_Id java c c++ hadoop hdfs python html php
user1 1 0 0 1 1 0 0 0
user2 0 1 1 0 0 1 0 0
user3 1 0 0 1 1 0 0 0
user4 1 0 0 0 0 0 1 1
user5 0 0 0 1 1 0 0 1
You can join new DataFrame created by astype if need convert lists to str (else omit), then remove [] by strip and use get_dummies:
df = df[['user_Id']].join(df['skills'].astype(str).str.strip('[]').str.get_dummies(', '))
print (df)
user_Id c c++ hadoop hdfs html java php python
0 user1 0 0 1 1 0 1 0 0
1 user2 1 1 0 0 0 0 0 1
2 user3 0 0 1 1 0 1 0 0
3 user4 0 0 0 0 1 1 1 0
4 user5 0 0 1 1 0 0 1 0
df1 = df['skills'].astype(str).str.strip('[]').str.get_dummies(', ')
#if necessary remove ' from columns names
df1.columns = df1.columns.str.strip("'")
df = pd.concat([df['user_Id'], df1], axis=1)
print (df)
user_Id c c++ hadoop hdfs html java php python
0 user1 0 0 1 1 0 1 0 0
1 user2 1 1 0 0 0 0 0 1
2 user3 0 0 1 1 0 1 0 0
3 user4 0 0 0 0 1 1 1 0
4 user5 0 0 1 1 0 0 1 0