I have a data-frame X which has two categorical features and 41 numerical features. So X has total of 43 features.
Now, I would like to convert the categorical features into numerical levels so they can be used in RandomForest Classifier.
I have done following, where 0 and 1 indicate location of categorical features:
import pandas as pd
X = pd.read_csv("train.csv")
F1 = pd.get_dummies(X.iloc[:, 0])
F2 = pd.get_dummies(X.iloc[:, 1])
Then, I concatenate these two data-frames:
Xnew = pd.concat([F1, F2, X.ix[:, 2:]])
Now, Xnew has 63 features (F1 has 18 and F2 has 4 features, remaining 41 are from X)
Is this correct? Is there a better way of doing the same thing? Do I need to drop the first column from F1 and F2 to avoid collinearity?
Since F1 has 18 levels (not features) and F2 has 4, your result looks correct.
To avoid collinearity, you should better drop one of the columns (for each F1 and F2). Not necessarily the first column. Typically you drop the column with the most common level.
Why the one with the most common level? Think about feature importance. If you drop one column, it has no chance to get its importance estimated. This level (the one you dropped) is like your "base level". Only deviations from the base level can be marked as important or not.
Related
The Meijer's G function is a neat instrument to treat multiplication of random variables, and for a work I am conducting on this subject I am trying to use it in Sympy ( since it is not present in Sage or other free programs).
It looks that the "meijerg" packages available in Sympy provide a wide set of instruments
I succeeded to import it together with the relevant package for integrals ("sympy.integrals.meijerint") and could start and do some basic manipulations, like plotting, inverting argument (_flip_g), computing values, etc.
However, notwithstanding my best efforts, I cannot succeed and get Sympy to perform some of the most basic simplifications, for example the "absorbtion" of powers of the argument.
So after defining
b1,b2,b3,b4,b5,d1,d2,d3,d4,d5 = symbols('b1 b2 b3 b4 b5 d1 d2 d3 d4 d5')
a1,a2,a3,a4,a5,c1,c2,c3,c4,c5 = symbols('a1 a2 a3 a4 a5 c1 c2 c3 c4 c5')
y,w,z =symbols('y w z',positive=True)
def G1(x):
return(meijerg([[a1,a2,a3],[a4,a5]], [[b1,b2],[b3,b4]], x))
def G2(x):
return(meijerg([[c1,c2],[c3]], [[d1,d2,d3],[d4]], x))
then asking the integral
Ris=_int0oo(G1(y*x),G2(w*x),x)
Ris
I get (on Jupyter)
and no way to "absorb" the y at denominator.
Instead, if I input
integrate(G1(y*x)*G2(w*x),(x,0,oo))
I get
and the first line is in fact what I would like to get.
So my question is why the absorbtion simplification is not attainable/ how can be attained with any of the instruments in the package (_rewrite1, _guess_expansion, etc.)
---- addendum ---
I realize from the comments that I fell into a newby trap: thanks indeed to Davide for signalling.
However, apart from and before integrating, some basic algebraic manouvres on G like inverting the argument (the "hidden" tool _flip_g), absorbing a power of the argument, rewriting a function as G, and alike, would be much useful.
Any way to properly access them ? if not, it would remain as a kind request to developers to render them utilizable. Thanks
I believe I understand when Concatenate needs to be called, on what data and why. What I'm trying to understand is what physically happens to the input columns data when Concatenate is called.
Is this some kind of a hash function that hashes all the input data from the columns and generates a result?
In other words, I would like to know if that is technically possible to restore original values from the value generated by Concatenate?
Is the order of data columns passed into Concatenate affects the resulting model and in what way?
Why I'm asking all that. I'm trying to understand what input parameters and in what way affect the quality of the produced model. I have many input columns of data. They are all rather important and it is important the relation between those values. If Concatenate does something simple and loses the relations between values I would try one approach to improve the quality of the model. If it is rather complex and keeps details of the values I would use other approaches.
In ML.NET, Concatenate takes individual features (of the same type) and creates a feature vector.
In pattern recognition and machine learning, a feature vector is an n-dimensional vector of numerical features that represent some object. Many algorithms in machine learning require a numerical representation of objects, since such representations facilitate processing and statistical analysis. When representing images, the feature values might correspond to the pixels of an image, while when representing texts the features might be the frequencies of occurrence of textual terms. Feature vectors are equivalent to the vectors of explanatory variables used in statistical procedures such as linear regression.
To my understanding there's no hashing involved. Conceptually you can think of it like the String.Join method, where you're taking individual elements and join them into one. In this case, that single component is a feature vector that as a whole represents the underlying data as an array of type T where T is the data type of the individual columns.
As a result, you can always access the individual components and order should not matter.
Here's an example using F# that takes data, creates a feature vector using the concatenate transform, and accesses the individual components:
#r "nuget:Microsoft.ML"
open Microsoft.ML
open Microsoft.ML.Data
// Raw data
let housingData =
seq {
{| NumRooms = 3f; NumBaths = 2f ; SqFt = 1200f|}
{| NumRooms = 2f; NumBaths = 1f ; SqFt = 800f|}
{| NumRooms = 6f; NumBaths = 7f ; SqFt = 5000f|}
}
// Initialize MLContext
let ctx = new MLContext()
// Load data into IDataView
let dataView = ctx.Data.LoadFromEnumerable(housingData)
// Get individual column names (NumRooms, NumBaths, SqFt)
let columnNames =
dataView.Schema
|> Seq.map(fun col -> col.Name)
|> Array.ofSeq
// Create pipeline with concatenate transform
let pipeline = ctx.Transforms.Concatenate("Features", columnNames)
// Fit data to pipeline and apply transform
let transformedData = pipeline.Fit(dataView).Transform(dataView)
// Get "Feature" column containing the result of applying Concatenate transform
let features = transformedData.GetColumn<float32 array>("Features")
// Deconstruct feature vector and print out individual features
printfn "Rooms | Baths | Sqft"
for [|rooms;baths;sqft|] in features do
printfn $"{rooms} | {baths} | {sqft}"
The result output to the console is:
Rooms | Baths | Sqft
2 | 3 | 1200
1 | 2 | 800
7 | 6 | 5000
If you're looking to understand the impact individual features have on your model, I'd suggest looking at Permutation Feature Importance (PFI) and Feature Contribution Calculation
I'm trying to reproduce that lovely website that helps you calculate the cost of something you bought into my personnal google sheet so it's easier for me to use it.
I'm seeking here for help since I don't really know how to adapt the math when you change the value of year/month/day.
As you may see, it's able to calculate the cost by year, but when you change the value to month for example, I don't know how to make it adjust the results.
I've tried =SUMIF, =IF, but I can't seem to find a clear way to do it.
here is the doc
Thanks a lot!
I think what you are looking for is the SWITCH function:
You can in the cell D6 use the following formula:
=SWITCH(F2; I2; E1/E2; I3; E1*12/E2; I4; E1*52/E2; I5; E1*365/E2)
The logic is:
check the cell F2 (where you have the dropdown)
if the value of F2 equals I2 (Year) then, just divide the cost by the number of years
if the value of F2 equals I3 (Month), then make E1*12 and divide it by the the E2 (same as (E1/E2)/12
if the value of F2 equals I4 (Week), then calculate E1*52/E2 (same as above but with 52 weeks)
if the value of F2 equals I5 (Day), then calculate E1*365/E2 (same as above but with 365 days)
And so on on the other cells, just change the differences between the formulas, between day, week, month and year.
I have 2 columns of data in columns AX and AY. Each value in the cell represents a categorical value, namely "Strong", "Good", "Moderate", "Weak", arranged in decreasing order of strength of evidence, i.e. Weak evidence to support... Strong evidence to support etc.
I want to create a new column that chooses the lower strength of evidence category. For e.g, if AX2 = Weak, and AY2 = Strong, AZ2 = Weak (where AZ is the new column I want the new values to fall into). Similarly, if AX3 = Good, and AY3 = Moderate, then AZ3 = "Moderate".
If there was some way for me to set a hierarchy, similar to spatial thinking concepts in GIS where the minimal value is selected, whereby the value in AZ will be the minimum value (i.e. of lower strength of evidence) between AX and AY, i.e. choosing one of the 2 cell values.
In Excel, I tried doing this using the IF, AND, OR statements, and was thinking of doing e.g. If AX = Weak and AY is any of the four, AZ = Weak. And I was thinking of repeating this for the other scenarios e.g. If AX = Moderate and AY is any of the 3 (Moderate, Good, Strong), AZ = Moderate.
(My code)
=IF(AND(AX4="Weak",OR(AY4="Weak",AY4="Moderate",AY4="Good",AY4="Strong")),"Weak"," ")
Then I realized, while currently I am attempting to fix the value for AX, while using OR functions for AY, I would have to repeat the same thing in the other direction, i.e. fix a value for AY, then using OR functions for AX, to prevent myself from excluding certain scenarios
My current code only works for creating AZ values = Weak, and when I attempted to have multiple OR functions at the start to define different scenarios, I received the error message telling me that was too many arguments.
I have come to the conclusion that the way I am attempting to perform this task is very inefficient, and would greatly value any and all advice.
Yes, often inefficient to apply Excel, designed for numbers, to text. However, possible (with 'conversion' on the fly and back again):
=CHOOSE(MAX(MATCH(AX2,{"Strong","Good","Moderate","Weak"},0),MATCH(AY2,{"Strong","Good","Moderate","Weak"},0)),"Strong","Good","Moderate","Weak")
I have a data set with tens of millions of rows. Several columns on this data represent categorical features. Each level of these features is represented by an alpha-numeric string like "b009d929".
C1 C2 C3 C4 C5 C6 C7
68fd1e64 80e26c9b fb936136 7b4723c4 25c83c98 7e0ccccf de7995b8 ...
68fd1e64 f0cf0024 6f67f7e5 41274cd7 25c83c98 fe6b92e5 922afcc0
I'd like to to be able to use Python to map each distinct level to a number to save memory. So that feature C1's levels would be replaced by numbers from 1 to C1_n, C2's levels would be replaced by numbers from 1 to C2_n...
Each feature has different number of levels, ranging from under 10 to 10k+.
I tried dictionaries with Pandas' .replace() but it gets extremely slow.
What is a fast way to approach this problem?
I figured out that the categorical features values were hashed onto 32 bits. So I ended up reading the file in chunks and applying this simple function
int(categorical_feature_value, 16)