I have a dataset consisting of categorical and numerical data with 124 features. In order to reduce its dimensionality I want to remove irrelevant features. However, to run the dataset against a feature selection algorithm I one hot encoded it with get_dummies, which increased the number of features to 391.
In[16]:
X_train.columns
Out[16]:
Index([u'port_7', u'port_9', u'port_13', u'port_17', u'port_19', u'port_21',
...
u'os_cpes.1_2', u'os_cpes.1_1'], dtype='object', length=391)
With the resulting data I can run recursive feature elimination with cross validation, as per the Scikit Learn example:
Which produces:
Cross Validated Score vs Features Graph
Given that the optimal number of features identified was 8, how do I identify the feature names? I am assuming that I can extract them into a new DataFrame for use in a classification algorithm?
[EDIT]
I have achieved this as follows, with help from this post:
def column_index(df, query_cols):
cols = df.columns.values
sidx = np.argsort(cols)
return sidx[np.searchsorted(cols, query_cols, sorter = sidx)]
feature_index = []
features = []
column_index(X_dev_train, X_dev_train.columns.values)
for num, i in enumerate(rfecv.get_support(), start=0):
if i == True:
feature_index.append(str(num))
for num, i in enumerate(X_dev_train.columns.values, start=0):
if str(num) in feature_index:
features.append(X_dev_train.columns.values[num])
print("Features Selected: {}\n".format(len(feature_index)))
print("Features Indexes: \n{}\n".format(feature_index))
print("Feature Names: \n{}".format(features))
which produces:
Features Selected: 8
Features Indexes:
['5', '6', '20', '26', '27', '28', '67', '98']
Feature Names:
['port_21', 'port_22', 'port_199', 'port_512', 'port_513', 'port_514', 'port_3306', 'port_32768']
Given that one hot encoding introduces multicollinearity, I don't think the target column selection is ideal because the features it has chosen are non-encoded continual data features. I have tried re-adding the target column unencoded but RFE throws the following error because the data is categorical:
ValueError: could not convert string to float: Wireless Access Point
Do I need to group multiple one hot encoded feature columns to act as the target?
[EDIT 2]
If I simply LabelEncode the target column, I can use this target as 'y' see example again. However, the output determines only a single feature (the target column) as optimal. I think this might be because of the one hot encoding, should I be looking at producing a dense array and if so, can it be run against RFE?
Thanks,
Adam
You can do this:
`
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
rfe = RFE(model, 5)
rfe = rfe.fit(X, y)
print(rfe.support_)
print(rfe.ranking_)
f = rfe.get_support(1) #the most important features
X = df[df.columns[f]] # final features`
Then you can use X as input in your neural network or any algorithm
Answering my own question, I figured out the issue was related to the way I had one-hot encoded the data. Initially, I ran one hot encoding against all categorical columns as follows:
ohe_df = pd.get_dummies(df[df.columns]) # One-hot encode all columns
This introduced a large number of additional features. Taking a different approach, with some help from here, I have modified the encoding to encode multiple columns on a per-column/feature basis as follows:
cf_df = df.select_dtypes(include=[object]) # Get categorical features
nf_df = df.select_dtypes(exclude=[object]) # Get numerical features
ohe_df = nf_df.copy()
for feature in cf_df:
ohe_df[feature] = ohe_df.loc[:,(feature)].str.get_dummies().values.tolist()
Producing:
ohe_df.head(2) # Only showing a subset of the data
+---+---------------------------------------------------+-----------------+-----------------+-----------------------------------+---------------------------------------------------+
| | os_name | os_family | os_type | os_vendor | os_cpes.0 |
+---+---------------------------------------------------+-----------------+-----------------+-----------------------------------+---------------------------------------------------+
| 0 | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... | [0, 1, 0, 0, 0] | [1, 0, 0, 0, 0] | [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0] | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ... |
| 1 | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... | [0, 0, 0, 1, 0] | [0, 0, 0, 1, 0] | [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0] | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... |
+---+---------------------------------------------------+-----------------+-----------------+-----------------------------------+---------------------------------------------------+
Unfortunately, although this was what I was searching for, it didn't execute against RFECV. Next I thought perhaps I could take a slice of all the new features and pass them in as the target, but this resulted in an error. Finally, I realised I would have to iterate through all target values and take the top outputs from each. The code ended up looking something like this:
for num, feature in enumerate(features, start=0):
X = X_dev_train
y = X_dev_train[feature]
# Create the RFE object and compute a cross-validated score.
svc = SVC(kernel="linear")
# The "accuracy" scoring is proportional to the number of correct classifications
# step is the number of features to remove at each iteration
rfecv = RFECV(estimator=svc, step=1, cv=StratifiedKFold(kfold), scoring='accuracy')
try:
rfecv.fit(X, y)
print("Number of observations in each fold: {}".format(len(X)/kfold))
print("Optimal number of features : {}".format(rfecv.n_features_))
g_scores = rfecv.grid_scores_
indices = np.argsort(g_scores)[::-1]
print('Printing RFECV results:')
for num2, f in enumerate(range(X.shape[1]), start=0):
if g_scores[indices[f]] > 0.80:
if num2 < 10:
print("{}. Number of features: {} Grid_Score: {:0.3f}".format(f + 1, indices[f]+1, g_scores[indices[f]]))
print "\nTop features sorted by rank:"
results = sorted(zip(map(lambda x: round(x, 4), rfecv.ranking_), X.columns.values))
for num3, i in enumerate(results, start=0):
if num3 < 10:
print i
# Plot number of features VS. cross-validation scores
plt.rc("figure", figsize=(8, 5))
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("CV score (of correct classifications)")
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.show()
except ValueError:
pass
I'm sure this could be cleaner, may be even plotted in one graph, but it works for me.
Cheers,
Related
I want to classify images using a pretrained model ResNet50 with Pytorch. I am faced with the problem of transferring data from the dataset to the model.
As far as I understood, it is necessary to transfer images for training in a tensor with the following dimension: (N, 4, 512,512), where N is the number of images, 4 is the number of channels, and 512 is the width and height of the picture. And also you need to pass "targets" as an array. Now I have a Pandas DataFrame with columns "Image", "Label". In the column "Images" I have a list of dimensions (512, 512, 4).
I tried writing data to an array, but it takes too long and takes up a lot of memory. Is there some other way to do this? So, my question is "How can I transfer data into model?"
This is part of my database:
Number
Image
label
0
[[[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0...
0
1
[[[4, 3, 0, 18], [82, 0, 0, 27], [11, 4, 0, 20...
14
2
[[[2, 2, 0, 0], [1, 5, 0, 1], [0, 5, 0, 0], [2...
14
3
[[[7, 1, 0, 24], [31, 0, 0, 14], [23, 3, 0, 13...
3
...
...
...
I tried to do it in the following way:
x_train = []
y_train = []
for data in range(N):
x_train.append(train_df['Image'].iloc[data])
y_train.append(train_df['Label'].iloc[data])
x_train = torch.tensor(x_train)
y_train = torch.tensor(y_train)
y_train = y_train.view(-1)
x_train = x_train.permute(0, 3, 1, 2)
I would like to make all the values in the first list inside the list of lists named "child_Before" below zero. The piece of code I wrote to accomplish this task is also shown below after the list:
child_Before = [[9, 12, 7, 3, 13, 14, 10, 5, 4, 11, 8, 6, 2],
[1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1],
[[1, 0], [1, 1]]]
for elem in range(len(child_Before[0])):
child_Before[0][elem] = 0
Below is the expected result:
child_After = [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1],
[[1, 0], [1, 1]]]
However, I think there should be a more nibble way to accomplish this exercise. Hence, I welcome your help. Thank you in advance.
just to add a creative answer
import numpy as np
child_Before[0] = (np.array(child_Before[0])&0).tolist()
this is bad practice though since i'm using bitwise operasions in a senario where it is not intuitive, and i think there is a slight chance i'm making 2 loops xD on the bright site the & which is making all the zeros is O(1) time complexity
Just create a list of [0] with the same length as the original list.
# Answer to this question - make first list in the list to be all 0
child_Before[0] = [0] * len(child_Before[0])
As for you answer, I can correct it to make all the elements in the lists of this list to be zero.
# Make all elements 0
for child in range(len(child_Before)):
child_Before[child] = [0] * len(child_Before[child])
Use list comprehension:
child_after = [[i if n != 0 else 0 for i in j] for n, j in enumerate(child_Before)]
How can I convert current datetime to epoch/unix time in seconds?
The functions listed below do not offer this as far as I can see..
DateTime functions
in C# it would be done this way
var epoch = (DateTime.UtcNow - new DateTime(1970, 1, 1, 0, 0, 0, DateTimeKind.Utc)).TotalSeconds;
You can do something quite similar in M:
epoch = Duration.TotalSeconds(DateTimeZone.UtcNow() - #datetimezone(1970, 1, 1, 0, 0, 0, 0, 0))
Hi I have a dictionary like the below:
b = {'tat': 0, 'del': 4, 'galadriel': 0, 'sire': 0, 'caulimovirus': 4, 'retrofit': 0, 'tork': 0, 'caulimoviridae_dom2': 0, 'reina': 4, 'oryco': 2, 'cavemovirus': 1, 'soymovrius': 0, 'badnavirus': 0, 'crm': 0, 'athila': 0}
I want to find all keys with the maximum value as a list. However,
max(a, key=a.get)
only gives the first key element, 'del'.
How should I find all the keys with the maximum values? Like the below.
new_list = ['del', 'caulimovirus', 'reina']
maxv = max(b.values())
new_list = [k for k, v in b.items() if v == maxv]
I have a 2-dimensional array of ones and zeros called M where the g rows represent groups and the a columns represent articles. M maps groups and articles. If a given article "art" belongs to group "gr" then we have M[gr,art]=1; if not we have M[gr,art]=0.
Now, I would like to convert M into a square a x a matrix of ones and zeros (call it N) where if an article "art1" is in the same group as article "art2", we have N(art1,art2)=1 and N(art1,art2)=0 otherwise. N is clearly symmetric with 1's in the diagonal.
How do I construct N based on M?
Many thanks for your suggestions - and sorry if this is trivial (still new to python...)!
So you have a boolean matrix M like the following:
>>> M
array([[1, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 1],
[0, 0, 1, 0, 0, 0],
[1, 0, 1, 0, 0, 0]])
>>> ngroups, narticles = M.shape
and what you want is a matrix of shape (narticles, narticles) that represents co-occurrence. That's simply the square of the matrix:
>>> np.dot(M, M.T)
array([[1, 0, 0, 1],
[0, 2, 0, 0],
[0, 0, 1, 1],
[1, 0, 1, 2]])
... except that you don't want counts, so set entries > 0 to 1.
>>> N = np.dot(M, M.T)
>>> N[N > 0] = 1
>>> N
array([[1, 0, 0, 1],
[0, 1, 0, 0],
[0, 0, 1, 1],
[1, 0, 1, 1]])