what if i pass categorical value for ML.Net prediction never passed before - ml.net

for example, if I trained the model for these values
Column1 = A , Column2 = B , Column3 = C , Label = 10
Column1 = D , Column2 = E , Column3 = F , Label = 20
Column1 = G , Column2 = H , Column3 = I , Label = 30
What if i want to predict?
Column1 = A , Column2 = B , Column3 = Z
what the model do for that?

It depends on how you process the categorical data.
If, for example, you used dictionary-based one-hot vectorizer:
new CategoricalOneHotVectorizer("Column2", "Column2", "Column3")
then the model will build a dictionary of terms per column:
Column1 -> [A, D, G]
Column2 -> [B, E, H]
Column3 -> [C, F, I]
If the value has not been seen (is not present in a dictionary), the CategoricalOneHotVectorizer assigns zero to all the 'one-hot' slots. So your example A B Z will turn into 1 0 0 1 0 0 0 0 0.
If, on the other hand, you use hash-based one-hot encoding:
new CategoricalHashOneHotVectorizer("Column2", "Column2", "Column3")
the incoming value Z will be hashed in the same way as the seen values C, F and I, and this will activate one of the 2^HashBits slots of the output column, based on the value of the hash.
The doc on the CategoricalOneHotVectorizer is not very clear on this one, but it still says:
The Key value is the one-based index of the slot set in the Ind/Bag options. If the Key option is not found, it is assigned the value zero.

Related

subset of a list by first match of a part of the column's name

I have a list (L) with several form of AB variable (like AB_1, AB_1_1 ,...), can I have a subset of list with only the first column that matches AB form.
List (L) and desired result as List (R) are as follow:
L1 = data.frame(AB_1 = c(1:4) , AB_1_1 = c(1:4) , C1 = c(1:4))
L2 = data.frame(AB_1_1 = c(1:4) , AB_2 = c(1:4), D = c(1:4) )
L=list(L1,L2)
R1 = data.frame(AB_1 = c(1:4) , C1 = c(1:4))
R2 = data.frame(AB_1_1 = c(1:4) , D = c(1:4))
R=list(R1,R2)
It is not the best answer, but it is a solution:
First change the name of all columns start with AB... to AB, and then remove the duplicate column names for each data frame in list (L).
for (i in 1:length(L)){
colnames(L[[i]])[grepl('AB',colnames(L[[i]]))] <- 'AB'
L[[i]] <- L[[i]][ , !duplicated(colnames(L[[i]]))]
}

Power BI - Matching closest 3D points from two tables

I have two tables (Table 1 and Table 2) both containing thousands of three dimensional point coordinates (X, Y, Z), Table 2 also has an attribute column.
Table 1
X
Y
Z
6007
44268
1053
6020
44269
1051
Table 2
X
Y
Z
Attribute
6011
44310
1031
A
6049
44271
1112
B
I need to populate a calculated column in Table 1 with an attribute from Table 2 based on the minimum distance between points in 3D space. Basically, match the points in Table 1 to the closest point in Table 2 and then fetch the attribute from Table 2.
So far I have tried rounding X, Y and Z in both tables, then concatenating the rounded values into a separate column in each table. I then use DAX:
CALCULATE(FIRSTNONBLANK(Table 2 [Attribute],1),FILTER(ALL(Table2), Table 2[XYZ]=Table 1 [XYZ])).
This has given me reasonable success depending on the degree of rounding applied to the coordinates.
Is there a better way to achieve this in Power Bi?
This is similar to this post, except with a simpler distance function. See also this post.
Assuming you want the standard Euclidean Distance:
ClosestPointAttribute =
MINX (
TOPN (
1,
Table2,
( Table2[X] - Table1[X] ) ^ 2 +
( Table2[Y] - Table1[Y] ) ^ 2 +
( Table2[Z] - Table1[Z] ) ^ 2,
ASC
),
Table2[Attribute]
)
Note: I've omitted the SQRT from the formula because we don't need the actual distance, just the ordering (and SQRT preserves order since it's a strictly increasing function). You can include it if you prefer.
A function in M Code:
(p1 as list, q1 as list)=>
let
f = List.Generate(
()=> [x = Number.Power(p1{0}-q1{0},2), idx=0],
each [idx]<List.Count(p1),
each [x = Number.Power(p1{[idx]+1}-q1{[idx]+1},2), idx=[idx]+1],
each [x]
),
r = Number.Sqrt(List.Sum(f))
in
r
Each list is a set of coordinates and the function will return the distance between p and q
The above function (which I named fnDistance) can be incorporated into power query code as in this example:
let
//Read in both tables and set data types
Source2 =Excel.CurrentWorkbook(){[Name="Table_2"]}[Content],
table2 = Table.TransformColumnTypes(Source2,{{"X", Int64.Type}, {"Y", Int64.Type}, {"Z", Int64.Type},{"Attribute", Text.Type}}),
Source = Excel.CurrentWorkbook(){[Name="Table_1"]}[Content],
table1 = Table.TransformColumnTypes(Source,{{"X", Int64.Type}, {"Y", Int64.Type}, {"Z", Int64.Type}}),
//calculate distances from Table 1 coordinates to each of the Table 2 coordinates and store in a List
custom = Table.AddColumn(table1,"Distances", each
let
t2 = Table.ToRecords(table2),
X=[X],
Y=[Y],
Z=[Z],
distances = List.Generate(()=>
[d=fnDistance({X,Y,Z},{t2{0}[X],t2{0}[Y],t2{0}[Z]}),a=t2{0}[Attribute], idx=0],
each [idx] < List.Count(t2),
each [d=fnDistance({X,Y,Z},{t2{[idx]+1}[X],t2{[idx]+1}[Y],t2{[idx]+1}[Z]}),a=t2{[idx]+1}[Attribute], idx=[idx]+1],
each {[d],[a]}),
//determine set of coordinates with the minimum distance and return associate Attribute
minDistance = List.Min(List.Alternate(List.Combine(distances),1,1,1)),
attribute = List.Range(List.Combine(distances), List.PositionOf(List.Combine(distances),minDistance)+1,1){0}
in
attribute, Text.Type)
in
custom

Applying Rcpp on a dataframe

I'm new to C++ and exploring faster computation possibilities on R through the Rcpp package. The actual dataframe contains over ~2 million rows, and is quite slow.
Existing Dataframes
Main Dataframe
df<-data.frame(z = c("a","b","c"), a = c(303,403,503), b = c(203,103,803), c = c(903,803,703))
Cost Dataframe
cost <- data.frame("103" = 4, "203" = 5, "303" = 6, "403" = 7, "503" = 8, "603" = 9, "703" = 10, "803" = 11, "903" = 12)
colnames(cost) <- c("103", "203", "303", "403", "503", "603", "703", "803", "903")
Steps
df contains z which is a categorical variable with levels a, b and c. I had done a merge operation from another dataframe to bring in a,b,c into df with the specific nos.
First step would be to match each row in z with the column names (a,b or c) and create a new column called 'type' and copy the corresponding number.
So the first row would read,
df$z[1] = "a"
df$type[1]= 303
Now it must match df$type with column names in another dataframe called 'cost' and create df$cost. The cost dataframe contains column names as numbers e.g. "103", "203" etc.
For our example, df$cost[1] = 6. It matches df$type[1] = 303 with cost$303[1]=6
Final Dataframe should look like this - Created a sample output
df1 <- data.frame(z = c("a","b","c"), type = c("303", "103", "703"), cost = c(6,4,10))
A possible solution, not very elegant but does the job:
library(reshape2)
tmp <- cbind(cost,melt(df)) # create a unique data frame
row.idx <- which(tmp$z==tmp$variable) # row index of matching values
col.val <- match(as.character(tmp$value[row.idx]), names(tmp) ) # find corresponding values in the column names
# now put all together
df2 <- data.frame('z'=unique(df$z),
'type' = tmp$value[row.idx],
'cost' = as.numeric(tmp[1,col.val]) )
the output:
> df2
z type cost
1 a 303 6
2 b 103 4
3 c 703 10
see if it works

Conditional calculation based on another column

I have a cross reference table and another table with the list of "Items"
I connect "PKG" to "Item" as "PKG" has distinct values.
Example:
**Cross table** **Item table**
Bulk PKG Item Value
A D A 2
A E B 1
B F C 4
C G D 5
E 8
F 3
G 1
After connecting the 2 above tables by PKG and ITEM i get the following result
Item Value Bulk PKG
A 2
B 1
C 4
D 5 A D
E 8 A E
F 3 B F
G 1 C G
As you can see nothing shows up for the first 3 values since it is connected by pkg and those are "Bulk" values.
I am trying to create a new column that uses the cross reference table
I want to create the following with a new column
Item Value Bulk PKG NEW COLUMN
A 2 5
B 1 3
C 4 1
D 5 A D 5.75
E 8 A E 9.2
F 3 B F 3.45
G 1 C G 1.15
The new column is what I am trying to create.
I want the original values to show up for bulk as they appear for pkg. I then want the Pkg items to be 15% higher than the original value.
How can I calculate this based on the setup?
Just write a conditional custom column in the query editor:
New Column = if [Bulk] = null then [Value] else 1.15 * [Value]
You can also do this as a DAX calculated column:
New Column = IF( ISBLANK( Table1[Bulk] ), Table1[Value], 1.15 * Table1[Value] )

In R, how can I insert a TRUE / FALSE column if strings in columns ARE / ARE NOT alphabetic?

Sample data:
df <- data.frame(noun1 = c("cat","dog"), noun2 = c("apple", "tree"))
noun1 noun2
1 cat apple
2 dog tree
How can I make a new column df$alpha that would read FALSE in row 1 and TRUE in row 2?
Thank you!
I think you can just apply is.unsorted() to each row, although you have to unlist it first (probably).
df <- data.frame(noun1 = c("cat","dog"), noun2 = c("apple", "tree"))
df$alpha <- apply(df,1,function(x) !is.unsorted(unlist(x)))
I found is.unsorted() via apropos("sort").