Power BI - Power Query Editor: Remove All Duplicates (Don't leave any rows that were part of the duplicate) - powerbi

So, I know how to remove duplicates which leave one row behind. What I want to do is remove all of the rows associated with a duplicate, because we don't know which of the duplicates we want to keep, and for our purposes therefore don't want any of them in our table. There are only two columns. One column contains the duplicates. The second has unique values per duplicate, but we don't want any of them to remain.
Thank you.

Here is a possible workaround. Use Table.Group to count the duplication, then retain only unique entries using Table.SelectRows.
let
Source = Table.FromRecords({
[a = "A", b = "a"], // < duplicated
[a = "B", b = "a"],
[a = "A", b = "a"] // < duplicated
})
in
Table.SelectRows(
Table.Group(Source, {"a", "b"}, {"Count", Table.RowCount}),
each [Count] = 1
)
/*
* Output
*
* a b Count
* --- --- -----
* B a 1
*/

Related

EXPAND MULTIPLE COLUMNS POWER BI

I´ve been struggling with this:
My table shows 3 records but when expanding there are like 100 columns. I used this code:
#"Expanded Data" = Table.ExpandTableColumn(#"Source", "Document", List.Union(List.Transform(#"Source"[Document]), each Table.ColumnNames(_))),
but it's not working. How can I expand simultaneously all columns? Also, inside those columns there are even more, for example I expand the first time end then those new columns have more records inside.
What could I do? Thanks in advance!
Try this ExpandAllRecords function - it recursively expands every Record-type column:
https://gist.github.com/Mike-Honey/0a252edf66c3c486b69b
This should work for Records Columns.
let
ExpandIt = (TableToExpand as table, optional ColumnName as text) =>
let
ListAllColumns = Table.ColumnNames(TableToExpand),
ColumnsTotal = Table.ColumnCount(TableToExpand),
CurrentColumnIndex = if (ColumnName = null) then 0 else List.PositionOf(ListAllColumns, ColumnName),
CurrentColumnName = ListAllColumns{CurrentColumnIndex},
CurrentColumnContent = Table.Column(TableToExpand, CurrentColumnName),
IsExpandable = if List.IsEmpty(List.Distinct(List.Select(CurrentColumnContent, each _ is record))) then false else true,
FieldsToExpand = if IsExpandable then Record.FieldNames(List.First(List.Select(CurrentColumnContent, each _ is record))) else {},
ColumnNewNames = List.Transform(FieldsToExpand, each CurrentColumnName &"."& _),
ExpandedTable = if IsExpandable then Table.ExpandRecordColumn(TableToExpand, CurrentColumnName, FieldsToExpand, ColumnNewNames) else TableToExpand,
NextColumnIndex = CurrentColumnIndex+1,
NextColumnName = ListAllColumns{NextColumnIndex},
OutputTable = if NextColumnIndex > ColumnsTotal-1 then ExpandedTable else #fx_ExpandIt(ExpandedTable, NextColumnName)
in
OutputTable
in
ExpandIt
This basically takes Table to Transform as the main argument,and then one by one checks if the Column Record is expandable (if column has "records" in it, it will expand it, otherwise move to next column and checks it again).
Then it returns the Output table once everything is expanded.
This function is calling the function from inside for each iteration.

Applying Rcpp on a dataframe

I'm new to C++ and exploring faster computation possibilities on R through the Rcpp package. The actual dataframe contains over ~2 million rows, and is quite slow.
Existing Dataframes
Main Dataframe
df<-data.frame(z = c("a","b","c"), a = c(303,403,503), b = c(203,103,803), c = c(903,803,703))
Cost Dataframe
cost <- data.frame("103" = 4, "203" = 5, "303" = 6, "403" = 7, "503" = 8, "603" = 9, "703" = 10, "803" = 11, "903" = 12)
colnames(cost) <- c("103", "203", "303", "403", "503", "603", "703", "803", "903")
Steps
df contains z which is a categorical variable with levels a, b and c. I had done a merge operation from another dataframe to bring in a,b,c into df with the specific nos.
First step would be to match each row in z with the column names (a,b or c) and create a new column called 'type' and copy the corresponding number.
So the first row would read,
df$z[1] = "a"
df$type[1]= 303
Now it must match df$type with column names in another dataframe called 'cost' and create df$cost. The cost dataframe contains column names as numbers e.g. "103", "203" etc.
For our example, df$cost[1] = 6. It matches df$type[1] = 303 with cost$303[1]=6
Final Dataframe should look like this - Created a sample output
df1 <- data.frame(z = c("a","b","c"), type = c("303", "103", "703"), cost = c(6,4,10))
A possible solution, not very elegant but does the job:
library(reshape2)
tmp <- cbind(cost,melt(df)) # create a unique data frame
row.idx <- which(tmp$z==tmp$variable) # row index of matching values
col.val <- match(as.character(tmp$value[row.idx]), names(tmp) ) # find corresponding values in the column names
# now put all together
df2 <- data.frame('z'=unique(df$z),
'type' = tmp$value[row.idx],
'cost' = as.numeric(tmp[1,col.val]) )
the output:
> df2
z type cost
1 a 303 6
2 b 103 4
3 c 703 10
see if it works

DAX: How do I write an IF statement to return a calculation for multiple (specific) values selected?

This is driving me nuts. Let's say we want to use a slicer which has two distinct values to choose from a dimension. There is A and B.
Let us also say that my Fact table is connected to this dimension, however it has the same dimension with more options.
My slicer now has A, B and (Blank). No biggie.
Let's now say I want to list out all of the possible calculation outcomes by selecting the slicer in a DAX formula, but in my visual I need all those outcomes to be listed in an IF() branched formula:
I can list out A:
IF(MAX(SlicerDim[Column]) = "A", CALCULATE([Calculation], SlicerDim[Column] = "A")
I can list out B:
IF(MAX(SlicerDim[Column]) = "A", CALCULATE([Calculation], SlicerDim[Column] = "A")
I can list out the (Blank) calculation too:
CALCULATE([Calculation], SlicerDim[Column] = Blank())
And I've managed to get a calculation out of it even when all of the slicer elements are on or off, using:
NOT(ISFILTERED(SlicerDim[Column])), CALCULATE([Calculation], SlicerDim[Column] = "A" || SlicerDim[Column] = "B")
Notice I need this IF() branch to actually return a calculation using A & B values, so now I have returns for when A or B or (Blank) or All or None are selected; BUT NOT when multiple values of A & B are selected!
How do I write out this IF() branch for it to return the same thing, but when both A & B are selected? Since there are only two real options in the slicer - I managed to use MIN() and MAX() get it to work by using their names or Index numbers.
IF((MIN(SlicerDim[Column]) = "A" && MAX(SlicerDim[Column]) = "B") || NOT(ISFILTERED(Paslauga[Paslauga])), CALCULATE([Calculation], SlicerDim[Column] = "A" || SlicerDim[Column] = "B")
BUT - I want a more understandable/robust/reusable formula, so that I could list out many selectable values from the slicer and have it return a calculation for specifically selected slicer values.
Please, help.
I've been searching high and low and there seems to not be an easy way to fix this albeit scraping the IF route and just using a damn slicer for this type of dilemma.
TL;DR:
How do I write an IF() branch calculation using DAX to get an outcome when All/None or non-blank or Specific slicer values are selected?
My best effort:
I am looking to improve the first IF() branch to not have to use MIN/MAX, because I would like to be able to reuse this type of formula if there were more than two real options in the slicer:
IF_branch =
IF((MIN(SlicerDim[Column]) = "A" && MAX(SlicerDim[Column]) = "B" || NOT(ISFILTERED(SlicerDim[Column])), CALCULATE([Calculation], SlicerDim[Column] = "A" || SlicerDim[Column] = "B"),
IF(MAX(SlicerDim[Column]) = "A", CALCULATE([Calculation], SlicerDim[Column] = "A"),
IF(MAX(SlicerDim[Column]) = "B", CALCULATE([Calculation], SlicerDim[Column] = "B"),
CALCULATE([Calculation], SlicerDim[Column] = BLANK()))))
Think what you are looking for is CONTAINS and VALUES
VALUES will give you the distinct current selection in scope.
CONTAINS lets you check if a table contains any row with a set of values.
[]
Formulas:
selected Scenarios = CONCATENATEX(VALUES(DimScenario[ScenarioName]);[ScenarioName];";")
Contains Forecast and Budget? =
IF(
CONTAINS(VALUES(DimScenario[ScenarioName]);[ScenarioName];"Forecast") &&
CONTAINS(VALUES(DimScenario[ScenarioName]);[ScenarioName];"Budget")
;"Yes"
;"No"
)

Beginner rbind function

I cannot for the life of me understand the rbind function. I've tried using the examples on here, but I can't figure out what I am doing incorrectly. All I would like to do is add the data from my second data frame under the first.
Does rbind require the columns be the same name or...?
ParticipantA=c("A","B","C","D")
Score1A=c("21","20","21","21")
Score2A=c("32","40","32","31")
Score3A=c("47","50","43","46")
BlockA=data.frame(ParticipantA,Score1A,Score2A,Score3A)
BlockA$Major=c("Computer_Science","Computer_Science","Computer_Science","Computer_Science")
BlockA$Gender=c("Female","Female","Male","Male")
ParticipantB=c("E","F","G","H")
Score1B=c("28","28","21","22")
Score2B=c("30","36","37","32")
Score3B=c("41","49","49","46")
BlockB=data.frame(ParticipantB,Score1B,Score2B,Score3B)
BlockB$Major=c("Medical","Medical","Medical","Medical")
BlockB$Gender=c("Female","Female","Male","Male")
rbind requires that all columns be of the same name and class.
The problem is in the column titles. rbind uses column titles to orient how it will bind the rows. The columns can be in different orders, R will just use the first element to determine column order.
Alternatively, adding another column to your data frames, with the value "A" or "B" in it could preserve your information without putting "A"s and "B"s in your column names <-- the reason you can't use rbind. The additional column would also allow you to do more analyses in R, e.g. regression and other linear models.
Here is one way to handle your data:
Create a uniform set of column names that can be used for the data frames "BlockA" and "BlockB"
final_colnames <- c("Block", "Participant", "Score1", "Score2", "Score3")
Create a new list to identify which block the participants belong to.
BlockA = c("A", "A", "A", "A")
Your previous data
ParticipantA = c("A", "B", "C", "D")
Score1A = c("21", "20", "21", "21")
Score2A = c("32", "40", "32", "31")
Score3A = c("47", "50", "43", "46")
The label "BlockA" is recycled here to name the new data frame, but not before adding the "BlockA" column list of "A" "A" "A" "A".
BlockA = data.frame(BlockA, ParticipantA, Score1A, Score2A, Score3A)
The new column names have to be added at this point, so that the number of names and the number of columns are equal.
colnames(BlockA) <- final_colnames
Now you can add the remaining columns
BlockA$Major = c("Computer_Science", "Computer_Science", "Computer_Science", "Computer_Science")
BlockA$Gender = c("Female", "Female", "Male", "Male")
BlockB is the same process
BlockB = c("B", "B", "B", "B") # the extra column
ParticipantB = c("E", "F", "G", "H")
Score1B = c("28", "28", "21", "22")
Score2B = c("30", "36", "37", "32")
Score3B = c("41", "49", "49", "46")
BlockB = data.frame(BlockB, ParticipantB, Score1B, Score2B, Score3B)
colnames(BlockB) <- final_colnames # renaming the columns
BlockB$Major = c("Medical", "Medical", "Medical", "Medical")
BlockB$Gender = c("Female", "Female", "Male", "Male")
Uniform column names mean that rbind will now work.
rbind(BlockA,BlockB)

Removing duplicates from the data

I already loaded 20 csv files with function:
tbl = list.files(pattern="*.csv")
list_of_data = lapply(tbl, read.csv)
I combined all of those filves into one:
all_data = do.call(rbind.fill, list_of_data)
In the new table is a column called "Accession". After combining many of the names (Accession) are repeated. And I would like to remove all of the duplicates.
Another problem is that some of those "names" are ALMOST the same. The difference is that there is name and after become the dot and the number.
Let me show you how it looks:
AT3G26450.1 <--
AT5G44520.2
AT4G24770.1
AT2G37220.2
AT3G02520.1
AT5G05270.1
AT1G32060.1
AT3G52380.1
AT2G43910.2
AT2G19760.1
AT3G26450.2 <--
<-- = Same sample, different names. Should be treated as one. So just ignore dot and a number after.
Tried this one:
all_data$CleanedAccession = str_extract(all_data$Accession, "^[[:alnum:]]+")
all_data = subset(all_data, !duplicated(CleanedAccession))
Error in `$<-.data.frame`(`*tmp*`, "CleanedAccession", value = character(0)) :
You can use this command to both subset and rename the values:
subset(transform(alldata, Ascension = sub("\\..*", "", Ascension)),
!duplicated(Ascension))
Ascension
1 AT3G26450
2 AT5G44520
3 AT4G24770
4 AT2G37220
5 AT3G02520
6 AT5G05270
7 AT1G32060
8 AT3G52380
9 AT2G43910
10 AT2G19760
What about
df <- data.frame( Accession = c("AT3G26450.1",
"AT5G44520.2",
"AT4G24770.1",
"AT2G37220.2",
"AT3G02520.1",
"AT5G05270.1",
"AT1G32060.1",
"AT3G52380.1",
"AT2G43910.2",
"AT2G19760.1",
"AT3G26450.2"))
df[!duplicated(unlist(lapply(strsplit(as.character(df$Accession),
".", fixed = T), "[", 1))), ]