Divide the testing set into subgroup, then make prediction on each subgroup separately - python-2.7

I have a dataset similar to the following table:
The prediction target is going to be the 'score' column. I'm wondering how can I divide the testing set into different subgroups such as score between 1 to 3 or then check the accuracy on each subgroup.
Now what I have is as follows:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
model = tree.DecisionTreeRegressor()
model.fit(X_train, y_train)
for i in (0,1,2,3,4):
y_new=y_test[(y_test>=i) & (y_test<=i+1)]
y_new_pred=model.predict(X_test)
print metrics.r2_score(y_new, y_new_pred)
However, my code did not work and this is the traceback that I get:
Found input variables with inconsistent numbers of samples: [14279,
55955]
I have tried the solution provided below, but it looks like that for the full score range (0-5) the r^2 is 0.67. but the subscore range for example (0-1,1-2,2-3,3-4,4-5) the r^2s are significantly lower than that of the full range. shouldn't some of the subscore r^2 be higher than 0.67 and some of them be lower than 0.67?
Could anyone kindly let me know where did I do wrong? Thanks a lot for all your help.

When you are computing the metrics, you have to filtered the predicted values (based on your subset condition).
Basically you are trying to compute
metrics.r2_score([1,3],[1,2,3,4,5])
which creates an error,
ValueError: Found input variables with inconsistent numbers of
samples: [2, 5]
Hence, my suggested solution would be
model.fit(X_train, y_train)
#compute the prediction only once.
y_pred = model.predict(X_test)
for i in (0,1,2,3,4):
#COMPUTE THE CONDITION FOR SUBSET HERE
subset = (y_test>=i) & (y_test<=i+1)
print metrics.r2_score(y_test [subset], y_pred[subset])

Related

how to give the test size in stratified kfold sampling in python?

Using sklearn , I want to have 3 splits (i.e. n_splits = 3)in the sample dataset and have a Train/Test ratio as 70:30. I'm able split the set into 3 folds but not able to define the test size (similar to train_test_split method).Is there a way to do define test sample size in StratifiedKFold ?
from sklearn.model_selection import StratifiedKFold as SKF
skf = SKF(n_splits=3)
skf.get_n_splits(X, y)
for train_index, test_index in skf.split(X, y):
# Loops over 3 iterations to have Train test stratified split
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
StratifiedKFold does by definition a K-fold split. This is, the iterator returned will yield (K-1) sets for training while 1 set for testing. K is controlled by n_splits, and thus, it does create groups of n_samples/K, and use all combinations of K-1 for training/testing. Refer to wikipedia or google K-fold cross-validation for more info about it.
In short, the size of the test set will be 1/K (i.e. 1/n_splits), so you can tune that parameter to control the test size (e.g. n_splits=3 will have test split of size 1/3 = 33% of your data). However, StratifiedKFold will iterate over K groups of K-1, and might not be what you want.
Having said that, you might be interested in StratifiedShuffleSplit, which returns just configurable number of splits and train/test ratio. If you just want a single split, you can tune n_splits=1 and yet keep test_size=0.3 (or whatever ratio you want).

Determining log_perplexity using ldamulticore for optimum number of topics

I am trying to determine the optimum number of topics for my LDA model using log perplexity in python. That is, I am graphing the log perplexity for a range of topics and determining the minimum perplexity. However, the graph I have obtained has negative values for log perplexity, when it should have positive values between 0 and 1.
#calculating the log perplexity per word as obtained by gensim code
##https://radimrehurek.com/gensim/models/atmodel.html
#parameters: pass in trained corpus
#return: graph of perplexity per word for varying number of topics
parameter_list = range(1, 500, 100)
grid ={}
for parameter_value in parameter_list:
model = models.LdaMulticore(corpus=corpus, workers=None, id2word=None,
num_topics=parameter_value, iterations=10)
grid[parameter_value]=[]
perplex=model.log_perplexity(corpus, total_docs=len(corpus))
grid[parameter_value].append(perplex)
df = pd.DataFrame(grid)
ax = plt.figure(figsize=(7, 4), dpi=300).add_subplot(111)
df.iloc[0].transpose().plot(ax=ax, color="#254F09")
plt.xlim(parameter_list[0], parameter_list[-1])
plt.ylabel('Perplexity')
plt.xlabel('topics')
plt.show()
The perplexity must be between 0 and 1. What you are computing is the log-perplexity. It is negative because the logarithm of a number in the (0,1) range is below zero.

Can I use a neural network on a linear regression using Keras? If yes , How?

I'm having difficulties setting up a NN in Keras. Please help me!
This is my code and I'm getting random values every time when I predict.
model = Sequential()
layer1 = Dense(5, input_shape = (5,))
model.add(layer1)
model.add(Activation('relu'))
layer2 = Dense(1)
model.add(layer2)
model.add(Activation('relu'))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(xtrain, ytrain, verbose=1)
I have 5 input features and want to predict a single continuous value as an output
Input space have five features.
The problem was that i am getting random prediction at same input. Now, I have reach the solution. It is happening just because of that i am not doing the normalisation of features.
Thanks
From my point of view,
you are not giving your input shape correctly
layer1 = Dense(5, input_shape = (5,))
What is your actual input shape?

NLTK package to estimate the (unigram) perplexity

I am trying to calculate the perplexity for the data I have. The code I am using is:
import sys
sys.path.append("/usr/local/anaconda/lib/python2.7/site-packages/nltk")
from nltk.corpus import brown
from nltk.model import NgramModel
from nltk.probability import LidstoneProbDist, WittenBellProbDist
estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2)
lm = NgramModel(3, brown.words(categories='news'), True, False, estimator)
print lm
But I am receiving the error,
File "/usr/local/anaconda/lib/python2.7/site-packages/nltk/model/ngram.py", line 107, in __init__
cfd[context][token] += 1
TypeError: 'int' object has no attribute '__getitem__'
I have already performed Latent Dirichlet Allocation for the data I have and I have generated the unigrams and their respective probabilities (they are normalized as the sum of total probabilities of the data is 1).
My unigrams and their probability looks like:
Negroponte 1.22948976891e-05
Andreas 7.11290670484e-07
Rheinberg 7.08255885794e-07
Joji 4.48481435106e-07
Helguson 1.89936727391e-07
CAPTION_spot 2.37395965468e-06
Mortimer 1.48540253778e-07
yellow 1.26582575863e-05
Sugar 1.49563800878e-06
four 0.000207196011781
This is just a fragment of the unigrams file I have. The same format is followed for about 1000s of lines. The total probabilities (second column) summed gives 1.
I am a budding programmer. This ngram.py belongs to the nltk package and I am confused as to how to rectify this. The sample code I have here is from the nltk documentation and I don't know what to do now. Please help on what I can do. Thanks in advance!
Perplexity is the inverse probability of the test set, normalized by the number of words. In the case of unigrams:
Now you say you have already constructed the unigram model, meaning, for each word you have the relevant probability. Then you only need to apply the formula. I assume you have a big dictionary unigram[word] that would provide the probability of each word in the corpus. You also need to have a test set. If your unigram model is not in the form of a dictionary, tell me what data structure you have used, so I could adapt it to my solution accordingly.
perplexity = 1
N = 0
for word in testset:
if word in unigram:
N += 1
perplexity = perplexity * (1/unigram[word])
perplexity = pow(perplexity, 1/float(N))
UPDATE:
As you asked for a complete working example, here's a very simple one.
Suppose this is our corpus:
corpus ="""
Monty Python (sometimes known as The Pythons) were a British surreal comedy group who created the sketch comedy show Monty Python's Flying Circus,
that first aired on the BBC on October 5, 1969. Forty-five episodes were made over four series. The Python phenomenon developed from the television series
into something larger in scope and impact, spawning touring stage shows, films, numerous albums, several books, and a stage musical.
The group's influence on comedy has been compared to The Beatles' influence on music."""
Here's how we construct the unigram model first:
import collections, nltk
# we first tokenize the text corpus
tokens = nltk.word_tokenize(corpus)
#here you construct the unigram language model
def unigram(tokens):
model = collections.defaultdict(lambda: 0.01)
for f in tokens:
try:
model[f] += 1
except KeyError:
model [f] = 1
continue
N = float(sum(model.values()))
for word in model:
model[word] = model[word]/N
return model
Our model here is smoothed. For words outside the scope of its knowledge, it assigns a low probability of 0.01. I already told you how to compute perplexity:
#computes perplexity of the unigram model on a testset
def perplexity(testset, model):
testset = testset.split()
perplexity = 1
N = 0
for word in testset:
N += 1
perplexity = perplexity * (1/model[word])
perplexity = pow(perplexity, 1/float(N))
return perplexity
Now we can test this on two different test sets:
testset1 = "Monty"
testset2 = "abracadabra gobbledygook rubbish"
model = unigram(tokens)
print perplexity(testset1, model)
print perplexity(testset2, model)
for which you get the following result:
>>>
49.09452736318415
99.99999999999997
Note that when dealing with perplexity, we try to reduce it. A language model that has less perplexity with regards to a certain test set is more desirable than one with a bigger perplexity. In the first test set, the word Monty was included in the unigram model, so the respective number for perplexity was also smaller.
Thanks for the code snippet! Shouldn't:
for word in model:
model[word] = model[word]/float(sum(model.values()))
be rather:
v = float(sum(model.values()))
for word in model:
model[word] = model[word]/v
Oh ... I see was already answered ...

How does scikit's cross validation work?

I have the following snippet:
print '\nfitting'
rfr = RandomForestRegressor(
n_estimators=10,
max_features='auto',
criterion='mse',
max_depth=None,
)
rfr.fit(X_train, y_train)
# scores
scores = cross_val_score(
estimator=rfr,
X=X_test,
y=y_test,
verbose=1,
cv=10,
n_jobs=4,
)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
1) Does running the cross_val_score do more training on the regressor?
2) Do I need to pass in a trained regressor or just a new one, e.g. estimator=RandomForestRegressor(). How then do I test the accuracy of a regressor, i.e. must I use another function in scikit?
3) My accuracy is about 2%. Is that the MSE score, where lower is better or is it the actual accuracy. If it is the actual accuracy, can you explain it, because it doesn't make sense how a regressor will accurately predict on a range.
It re-trains the estimator, k times in fact.
Untrained (or trained, but then the model is deleted and you're just wasting time).
It's the R² score, so that's not actually 2% but .02; R² is capped at 1 but can be negative. Accuracy is not well-defined for regression. (You can define it as for classification, but that makes no sense.)