I have a set of images split in train and test and I'm trying to detect features using SIFT from the train set.
The problem is that with my code I'm getting:
TypeError: image is not a numpy array, neither a scalar
Here's my code:
import glob
from cv2 import SIFT
import numpy as np
#creating a list of images
images = []
for infile in glob.glob('path'):
images.append(infile)
np.random.shuffle(images)
my_set = images
#splitting my set in test and train parts
train = my_set[:120]
test = my_set[120:]
#get descriptors of train part
for image in train:
SIFT().detect(image)
I've tried to change the variables train and test like this:
train = np.array(my_set[:120])
but I get the same error.
Related
I have this image shown below
And, here I am trying to define the threshold to distinguish bimodal class by using the Otsu technique based on intensity and then visualise those in the histogram. So far I have written following codes:
import matplotlib.pyplot as plt
import numpy as np
from skimage import data, io, img_as_ubyte
from skimage.filters import threshold_multiotsu
# Read an image
image = io.imread("Fig_1.png")
# Apply multi-Otsu threshold
thresholds = threshold_multiotsu(image,classes=5)
# Digitize (segment) original image into multiple classes.
#np.digitize assign values 0, 1, 2, 3, ... to pixels in each class.
regions = np.digitize(image, bins=thresholds)
output = img_as_ubyte(regions) #Convert 64 bit integer values to uint8
fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(10, 3.5))
# Plotting the original image.
ax[0].imshow(image, cmap='gray')
ax[0].set_title('Original')
ax[0].axis('off')
# Plotting the histogram and the two thresholds obtained from
# multi-Otsu.
ax[1].hist(image.ravel(), bins=255)
ax[1].set_title('Histogram')
for thresh in thresholds:
ax[1].axvline(thresh, color='r')
# Plotting the Multi Otsu result.
ax[2].imshow(regions, cmap='gray')
ax[2].set_title('Multi-Otsu result')
ax[2].axis('off')
plt.subplots_adjust()
plt.show()
This gives me the following result. Here As you can see Multi-Otsu result is totally black and does not show the two class of object present in the figure.
I choose classes=5 but this is bimodal hence putting classes=3 also giving me the same result.
Any advice on how to correct this? Thanks in advance.
I'm trying to run a simple XGBoost Prediction based on Google Cloud using this simple example https://cloud.google.com/ml-engine/docs/scikit/getting-predictions-xgboost#get_online_predictions
The model is building fine, but when I try to run a prediction with a sample input JSON it fails with error "Could not initialize DMatrix from inputs: could not convert string to float:" as shown in the screen below. I understand this is happening because the test-input has strings, I was hoping the Google machine learning model should have information to convert the categorical values to floats. I cannot expect my user to submit-online-prediction-request with float values.
Based on the tutorial it should work without converting the categorical values to floats. Please advise, I have attached the GIF with more details. Thanks
import json
import numpy as np
import os
import pandas as pd
import pickle
import xgboost as xgb
from sklearn.preprocessing import LabelEncoder
# these are the column labels from the census data files
COLUMNS = (
'age',
'workclass',
'fnlwgt',
'education',
'education-num',
'marital-status',
'occupation',
'relationship',
'race',
'sex',
'capital-gain',
'capital-loss',
'hours-per-week',
'native-country',
'income-level'
)
# categorical columns contain data that need to be turned into numerical
# values before being used by XGBoost
CATEGORICAL_COLUMNS = (
'workclass',
'education',
'marital-status',
'occupation',
'relationship',
'race',
'sex',
'native-country'
)
# load training set
with open('./census_data/adult.data', 'r') as train_data:
raw_training_data = pd.read_csv(train_data, header=None, names=COLUMNS)
# remove column we are trying to predict ('income-level') from features list
train_features = raw_training_data.drop('income-level', axis=1)
# create training labels list
train_labels = (raw_training_data['income-level'] == ' >50K')
# load test set
with open('./census_data/adult.test', 'r') as test_data:
raw_testing_data = pd.read_csv(test_data, names=COLUMNS, skiprows=1)
# remove column we are trying to predict ('income-level') from features list
test_features = raw_testing_data.drop('income-level', axis=1)
# create training labels list
test_labels = (raw_testing_data['income-level'] == ' >50K.')
# convert data in categorical columns to numerical values
encoders = {col:LabelEncoder() for col in CATEGORICAL_COLUMNS}
for col in CATEGORICAL_COLUMNS:
train_features[col] = encoders[col].fit_transform(train_features[col])
for col in CATEGORICAL_COLUMNS:
test_features[col] = encoders[col].fit_transform(test_features[col])
# load data into DMatrix object
dtrain = xgb.DMatrix(train_features, train_labels)
dtest = xgb.DMatrix(test_features)
# train XGBoost model
bst = xgb.train({}, dtrain, 20)
bst.save_model('./model.bst')
Here is a fix. Put the input shown in the Google documentation in a file input.json, then run this. The output is input_numerical.json and prediction will succeed if you use that in place of input.json.
This code is just preprocessing categorical columns to numerical forms using the same procedure as was done with training and test data.
import json
import pandas as pd
from sklearn.preprocessing import LabelEncoder
COLUMNS = (
"age",
"workclass",
"fnlwgt",
"education",
"education-num",
"marital-status",
"occupation",
"relationship",
"race",
"sex",
"capital-gain",
"capital-loss",
"hours-per-week",
"native-country",
"income-level",
)
# categorical columns contain data that need to be turned into numerical
# values before being used by XGBoost
CATEGORICAL_COLUMNS = (
"workclass",
"education",
"marital-status",
"occupation",
"relationship",
"race",
"sex",
"native-country",
)
with open("./input.json", "r") as json_lines:
rows = [json.loads(line) for line in json_lines]
prediction_features = pd.DataFrame(rows, columns=(COLUMNS[:-1]))
encoders = {col: LabelEncoder() for col in CATEGORICAL_COLUMNS}
for col in CATEGORICAL_COLUMNS:
prediction_features[col] = encoders[col].fit_transform(prediction_features[col])
with open("input_numerical.json", "w") as input_numerical:
for index, row in prediction_features.iterrows():
input_numerical.write(row.to_json(orient="values") + "\n")
I created this Google Issues Tracker ticket as the Google documentation is missing this important step.
You can use pandas to convert categorical strings into codes for model inputs. For prediction input you can define a dictionary for each category with corresponding category values and codes. For example, for workclass:
df['workclass_cat'] = df['workclass'].astype('category')
df['workclass_cat'] = df['workclass_cat'].cat.codes
workclass_dict = dict(zip(list(df['workclass'].values), list(df['workclass_cat'].values)))
If a prediction input is 'somestring' you can access its code as follows:
category_input = workclass_dict['somestring']
XGBoost models take floats as input. In your training script you converted the categorical variables into numbers. The same transformation needs to be done when submitting a prediction.
I'm interested in augmenting my dataset with random image transformations. I'm using Keras ImageDataGenerator, and I'm getting the following error when trying to apply random_transform to a single image:
--> x = apply_transform(x, transform matrix, img_channel_axis, fill_mode, cval)
>>> RuntimeError: affine matrix has wrong number of rows.
I found the source code for the ImageDataGenerator here. However, I'm not sure how to debug the runtime error. Below is the code I have:
from keras.preprocessing.image import img_to_array, load_img
from keras.preprocessing.image import ImageDataGenerator
from keras.applications.inception_v3 import preprocess_input
image_path = './figures/zebra.jpg'
#data augmentation
train_datagen = ImageDataGenerator(
rotation_range=40,
width_shift_range=0.2,
height_shift_range=0.2,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True,
fill_mode='nearest')
print "\nloading image..."
image = load_img(image_path, target_size=(299, 299))
image = img_to_array(image)
image = np.expand_dims(image, axis=0) # 1 x input_shape
image = preprocess_input(image)
train_datagen.fit(image)
image = train_datagen.random_transform(image)
The error occurs at the last line when calling random_transform.
The problem is that random_transform expects a 3D-array.
See the docstring:
def random_transform(self, x, seed=None):
"""Randomly augment a single image tensor.
# Arguments
x: 3D tensor, single image.
seed: random seed.
# Returns
A randomly transformed version of the input (same shape).
"""
So you'll need to call it before np.expand_dims.
I have 100 files that contain system call traces. Each files is presented as seen below:
setpgrp ioctl setpgrp ioctl ioctl ....
I am trying to load these files and perform kmean calculation on them to cluster them based on similarities. Based on a tutorial on the sklearn webpage I written the following:
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from sklearn import metrics
from sklearn.datasets import load_files
from sklearn.cluster import KMeans, MiniBatchKMeans
import numpy as np
# parse commandline arguments
op = OptionParser()
op.add_option("--lsa",
dest="n_components", type="int",
help="Preprocess documents with latent semantic analysis.")
op.add_option("--no-minibatch",
action="store_false", dest="minibatch", default=True,
help="Use ordinary k-means algorithm (in batch mode).")
op.add_option("--use-idf",
action="store_false", dest="use_idf", default=True,
help="Disable Inverse Document Frequency feature weighting.")
op.add_option("--n-features", type=int, default=10000,
help="Maximum number of features (dimensions)"
" to extract from text.")
op.add_option("--verbose",
action="store_true", dest="verbose", default=False,
help="Print progress reports inside k-means algorithm.")
print(__doc__)
op.print_help()
(opts, args) = op.parse_args()
if len(args) > 0:
op.error("this script takes no arguments.")
sys.exit(1)
print("Loading training data:")
trainingdata = load_files('C:\data\Training data')
print("%d documents" % len(trainingdata.data))
print()
print("Extracting features from the training trainingdata using a sparse vectorizer")
if opts.use_idf:
vectorizer = TfidfVectorizer(input="file",min_df=1)
X = vectorizer.fit_transform(trainingdata.data)
print("n_samples: %d, n_features: %d" % X.shape)
print()
if opts.n_components:
print("Performing dimensionality reduction using LSA")
# Vectorizer results are normalized, which makes KMeans behave as
# spherical k-means for better results. Since LSA/SVD results are
# not normalized, we have to redo the normalization.
svd = TruncatedSVD(opts.n_components)
lsa = make_pipeline(svd, Normalizer(copy=False))
X = lsa.fit_transform(X)
explained_variance = svd.explained_variance_ratio_.sum()
print("Explained variance of the SVD step: {}%".format(
int(explained_variance * 100)))
print()
However it seems that none of the files in the dataset directory get loaded into the memory when though all files are available. I get the following error when executing the program:
raise ValueError("empty vocabulary; perhaps the documents only"
ValueError: empty vocabulary; perhaps the documents only contain stop words
Can anyone tell me why the dataset is not being loaded? What am I doing wrong?
I finally managed to load the files. The approach to use Kmean in sklearn is to vectorize the training data (using tfidf or count_vectorizer), then transform your test data using the vectorization of your training data. Once that is done you can initialize the Kmean parameters, use the training data set vectors to create the kmean cluster. Finally you can cluster your test data around your training data centroid.
The following code does what is explained above.
#Read the data in a directory:
def readfile(dataDir):
data_set = []
for file in os.listdir(dataDir):
trainingfiles = os.path.join(dataDir, file)
if os.path.isfile(trainingfiles):
data = open(trainingfiles, 'r')
dataread=str.decode(data.read())
data_set.append(dataread)
return data_set
#fitting tfidf transfrom for training data
tfidf_vectorizer_trainingset = tfidf_vectorizer.fit_transform(readfile(trainingdataDir)).toarray()
#transform the test set based on the training set
tfidf_vectorizer_testset = tfidf_vectorizer.transform(readfile(testingdataDir)).toarray()
# Kmean Clustering parameters
kmean_parameters = KMeans(n_clusters=number_of_clusters, init='k-means++', max_iter=100, n_init=1)
#Cluster the training data based on the parameters
KmeanAnalysis_training = kmean_parameters.fit(tfidf_vectorizer_trainingset)
#transform the test data based on the clustering of the training data
KmeanAnalysis_test = kmean_parameters.transform(tfidf_vectorizer_testset)
I am trying to identify the type of noise based on that article:
Model selection with Probabilistic (PCA) and Factor Analysis (FA)
I am using scikit-learn-0.14.1.win32-py2.7 on win8 64bit
I know that it refers on version 0.15, however at the version 0.14 documentation it mentions that the score method is available for PCA so I guess it should normally work:
sklearn.decomposition.ProbabilisticPCA
The problem is that no matter which PCA I will use for the *cross_val_score*, I always get a type error message saying that the estimator PCA does not have a score method:
*TypeError: If no scoring is specified, the estimator passed should have a 'score' method. The estimator PCA(copy=True, n_components=None, whiten=False) does not.*
Any ideas why is that happening?
Many thanks in advance
Christos
X has 1000 samples of 40 features
here is a portion of the code:
import numpy as np
import csv
from scipy import linalg
from sklearn.decomposition import PCA, FactorAnalysis
from sklearn.cross_validation import cross_val_score
from sklearn.grid_search import GridSearchCV
from sklearn.covariance import ShrunkCovariance, LedoitWolf
#read in the training data
train_path = '<train data path>/train.csv'
reader = csv.reader(open(train_path,"rb"),delimiter=',')
train = list(reader)
X = np.array(train).astype('float')
n_samples = 1000
n_features = 40
n_components = np.arange(0, n_features, 4)
def compute_scores(X):
pca = PCA()
pca_scores = []
for n in n_components:
pca.n_components = n
pca_scores.append(np.mean(cross_val_score(pca, X, n_jobs=1)))
return pca_scores
pca_scores = compute_scores(X)
n_components_pca = n_components[np.argmax(pca_scores)]
Ok, I think I found the problem. it is not working with PCA, but it does work with PPCA
However, by not providing a cv number the cross_val_score automatically sets 3-fold cross validation
that created 3 sets with sizes 334, 333 and 333 (my initial training set contains 1000 samples)
Since nympy.mean cannot make a comparison between sets with different sizes (334 vs 333), python rises an exception.
thx