numpy.where returning empty index - python-2.7

So I'm trying to create two arrays using numpy. One array is a lot bigger than the other, so I want to search the large array to see where each element in my small array are located (i.e. what index). However when I run the code below, one of the elements in the small array cannot be found and I'm not sure why. Is it a data type mismatch?
Please advise, thank you!
import matplotlib.pyplot as plt
import numpy as np
GMean = np.array([4.23, 4.93, 5.67, 6.62, 4.67])
conc_x = np.arange(0.0, 90, 0.1)
GMean = np.round(GMean, decimals=1)
for i in np.nditer(GMean):
spec_index = np.where(conc_x==i) #look for index in conc_x data set where our GMean data point lies
print i
print spec_index
console output:
4.2
(array([42]),)
4.9
(array([49]),)
5.7
(array([57]),)
6.6
(array([], dtype=int32),) #why can it not find the index here?
4.7
(array([47]),)

So using numpy.around() instead of numpy.round() works. I get an index every time.
I thought they were the same but looking at the documentation, there is a subtle difference:
"Round an array to the given number of decimals."
vs:
"Evenly round to the given number of decimals."
So i'm thinking "evenly round" means it is rounding all trailing digits beyond the desired decimal, and therefore, both numbers you are comparing become exactly the same.
Hope this makes sense.

Related

NP where and if statment conditions

I am currently having a problem understanding np.where in relation to if statements. (https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.where.html) I have heard of it being more efficient. I am trying to get a better understanding of the np.where function but I havent found any examples that are making it clear. This is the if statement I really want to convert, can anyone assist? When you convert this can you give some more conditional examples perhaps. Would numpy conditional statements work better if np.where isn't a probably solution.
For the sake of a better explaination true_list and flat_list arent really list in the real problem, they're arrays I'm appending. So I'm just going to change their name to true_arr and flat_arr.
Heres more code:
combo = list(element)
flat_arr = np.concatenate(combo) #changes array dimensions to what I need
sum_flat_arr=flat_arr.sum(axis=0)
salary = sum_flat_arr[2]
values = sum_flat_arr[3]
if salary <= 5000 and values > 150:
true_arr = true_arr + flat_arr
true_arr is just an empty numpy array(not sure whats the best way to handle it to prefill it with the number of empty rows and columns or leave it completely blank and just append it)
flat_arr is just one whole array it looks like:
Out:
[['Johnny Tsunami' 'Driver' 1000 39]
['Snow White' 'Pistol' 2000 40]
['Michael B. Jackson' 'Pistol' 2500 46]
['Greg Ritcher' 'Lookout' 200 25]]
Essentially Name Job Salary and Value, instead of dataframes I'm trying to do everything in Numpy for speed. The reason why I'm not using np.concatenate is because I hear it's slower then adding them like list. If I'm wrong please explain.
Its just appending a list. If you cant do it this way and it needs to be a function it could be np.append or np.concatenate.
End all be all if non of this applies and I have been thinking totally the wrong way I'm simply looking for a numpy way to do if statements more efficiently(faster).
Can someone shoot me in the right direction.

Python 2.7, trying to find unique numbers from list of integers

I have to find unique integers from list of integers Node1ID ,containing over 80000 integer values. I have following code that works. However it is very slow. Takes over 5 minutes to execute. Looking for a faster way . Can somebody help?
Here is my code:
output = []
for x in Node1ID:
if x not in output:
output.append(x)
Thanks
Create a set from the list, then convert that back to a list.
output = list(set(Node1D))
Sets cannot contain duplicate elements, so the first conversion gets rid of all the duplicates.
Never Mind Looks like I just got it
answer is following
import numpy as np
unique=np.unique([Node1ID])
executes in less than 0.1 seconds
thanks

How to add interrupted time series in matplotlib

Since I am just a python beginner I hope, that I bother you not too much with my question. I'd like to plot only parts of a time series from a dataset such as this:
Code:
import numpy as np
import matplotlib.pyplot as plt
a=np.arange(10)
b=np.arange(100, 105)
c=np.arange(30, 40)
d=np.arange(55, 60)
e=np.append(a,b)
f=np.append(c,d)
plt.plot(e,f)
Plot:
Then I get a plot with a long diagonal line between the data series. What to do to get rid of that, in other words I want that the x-axis only shows 0,1,2,3,4,5,6,7,8,9,100,101,102,103,104 and the corresponding y-values and nothing in between. Moreover if I have a very long time series (e.g. from an oscilloscope measurement), is it possible to show up that some patterns are repeated with e.g. big dots in the middle of the plot? I've tried to convert the x data into a string and several other things, but it does not seem to work.
Focusing on this part of your question:
Then I get a plot with a long diagonal line between the data series. What to do to get rid of that [...]
It sounds that a scatteplot should meet your needs, so try this:
Add this to your code:
fig, ax = plt.subplots()
ax.scatter(e, f)
for i, txt in enumerate(e):
ax.annotate(txt, (e[i],f[i]))
fig.show()
To get this plot:
Unfortunately, I'm not quite sure what you're trying to accomplish with this part:
Moreover if I have a very long time series (e.g. from an oscilloscope
measurement)
Anyway, I hope my suggestion at least fixes the main part of your problem.

Different types of features to train Naive Bayes in Python Pandas

I would like to use a number of features to train with Naive Bayes classifier to classify 'A' or 'non-A'.
I have three features of different value types:
1) total_length - in positive integer
2) vowel-ratio - in decimal/fraction
3) twoLetters_lastName - a array containing multiple two-letters strings
# coding=utf-8
from nltk.corpus import names
import nltk
import random
import numpy as np
import pandas as pd
from pandas import DataFrame, Series
from sklearn.naive_bayes import GaussianNB
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
# Import data into pandas
data = pd.read_csv('XYZ.csv', header=0, encoding='utf-8',
low_memory=False)
df = DataFrame(data)
# Randomize records
df = df.reindex(np.random.permutation(df.index))
# Assign column into label Y
df_Y = df[df.AScan.notnull()][['AScan']].values # Labels are 'A' or 'non-A'
#print df_Y
# Assign column vector into attribute X
df_X = df[df.AScan.notnull()][['total_length', 'vowel_ratio', 'twoLetters_lastName']].values
#print df_X[0:10]
# Incorporate X and Y into ML algorithms
clf = GaussianNB()
clf.fit(df_X, df_Y)
df_Y is as follow:
[[u'non-A']
[u'A']
[u'non-A']
...,
[u'A']
[u'non-A']
[u'non-A']]
df_X is below:
[[9L 0.222222222 u"[u'ke', u'el', u'll', u'ly']"]
[17L 0.41176470600000004
u"[u'ma', u'ar', u'rg', u'ga', u'ar', u'ri', u'is']"]
[11L 0.454545455 u"[u'du', u'ub', u'bu', u'uc']"]
[11L 0.454545455 u"[u'ma', u'ah', u'he', u'er']"]
[15L 0.333333333 u"[u'ma', u'ag', u'ge', u'ee']"]
[13L 0.307692308 u"[u'jo', u'on', u'ne', u'es']"]
[12L 0.41666666700000005
u"[u'le', u'ef', u'f\\xe8', u'\\xe8v', u'vr', u're']"]
[15L 0.26666666699999997 u"[u'ni', u'ib', u'bl', u'le', u'et', u'tt']"]
[15L 0.333333333 u"[u'ki', u'in', u'ns', u'sa', u'al', u'll', u'la']"]
[11L 0.363636364 u"[u'mc', u'cn', u'ne', u'ei', u'il']"]]
I am getting this error:
E:\Program Files Extra\Python27\lib\site-packages\sklearn\naive_bayes.py:150: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
Traceback (most recent call last):
File "C:werwer\wer\wer.py", line 32, in <module>
clf.fit(df_X, df_Y)
File "E:\Program Files Extra\Python27\lib\site-packages\sklearn\naive_bayes.py", line 163, in fit
self.theta_[i, :] = np.mean(Xi, axis=0)
File "E:\Program Files Extra\Python27\lib\site-packages\numpy\core\fromnumeric.py", line 2727, in mean
out=out, keepdims=keepdims)
File "E:\Program Files Extra\Python27\lib\site-packages\numpy\core\_methods.py", line 69, in _mean
ret, rcount, out=ret, casting='unsafe', subok=False)
TypeError: unsupported operand type(s) for /: 'unicode' and 'long'
My understanding is I need to convert the features into one numpy array as a feature vector, but I don't think if I am preparing this X vector right since it contains very different value types.
Related questions: Choosing a Classification Algorithm to Classify Mix of Nominal and Numeric Data -- Mixing Categorial and Continuous Data in Naive Bayes Classifier Using Scikit-learn
Okay so there are a few things going on. As DalekSec pointed out, it's best practice to keep all your features as one type as you input them into a model like GaussianNB. The traceback indicates that while fitting the model, it tries to divide a string (presumably one of your unicode strings like u"[u'ke', u'el', u'll', u'ly']") by an integer. So what we need to do is convert the training data into a form that sklearn can use. We can do this a few ways, two of which ogrisel eloquently describes in this answer here.
We can convert all the continuous variables to categorical variables. In our case, this means converting total_length (in some cases you could probably treat this as a categorical variable, but let's not get ahead of ourselves) and vowel-ratio. For instance, you can basically bin the values you see in each feature to one of 5 values based on percentile: 'very small', 'small', 'medium', 'high', 'very high'. There's no real easy way in sk-learn as far as I know, but it should be pretty straightforward to do it yourself. The only thing that you would want to change is that you would want to use MultinomialNB instead of GaussianNB because you'll be dealing with features that would be better described by multinomial distributions rather than gaussian ones.
We can convert the categorical features to numeric ones for use with GaussianNB. Personally I find this to be the more intuitive approach. Basically, when dealing with text, you need to figure out what information you want to take from the text and pass to the classifier. It looks like to me that you want to extract the incidence of different two letter last names.
Normally I would ask you whether or not you have all the last names in your dataset, but since each one is only two letters each we can just store all the possible two letter names (including the unicode characters involving accent marks) with a minimal impact on performance. This is where something like sklearn's CountVectorizer might be useful. Assuming that you have every possible combination of two letter last names in your data, you can just directly use this to turn a row in your twoLetter_lastname column into a N-dimensional vector that records the number of occurrences of each unique last name in your row. Then just combine this new vector with your other two features into a numpy array.
In the case you do not have every possible combination of two letters (including accented ones), you should consider generating that list and pass it in as the 'vocabulary' for the CountVectorizer. This is so that your classifier knows how to handle all possible last names. It's not the end of the world if you don't handle all cases, but any new unseen two letter pairs will be ignored in this scheme.
Before you use these tools, you should make sure that you pass your last name column in as a list, and not as a string, as this can result in unintended behavior.
You can read more about general sklearn preprocessing here, and more about CountVectorizer and other text feature extraction tools provided by sklearn here. I use a lot of these tools daily, and recommend them for basic text extraction tasks. There are also plenty of tutorials and demos available online. You might also look for other types of methods of representation, like binarizing and one-hot encoding. There are many ways to solve this problem, it mostly depends on your specific problem/needs.
After you're able to turn all your data into one form or the other, you should be able to make use of either the Gaussian or Multinomial NB classifier. As for your error regarding the 1D vector, you printed df_Y and it looked like
[[u'non-A']
[u'A']
[u'non-A']
...,
[u'A']
[u'non-A']
[u'non-A']]
Basically, it's expecting this to be in a flat list, rather than as a column vector (a list of one-dimensional lists). Just reshape it accordingly by making use of commands like numpy.reshape() or numpy.ravel() (numpy.ravel() would probably be more appropriate, considering that you're dealing with just one column, as the error mentioned).
I'm not 100% sure, but I think scikit-learn.naive_bayes requires a purely numeric feature vector instead of a mixture of text and numbers. It looks like it crashes when trying to "divide" a unicode string by a long integer.
I can't be much help with finding numeric representations for text, but this scikit-learn tutorial might be a good start.

Xlrd list index out of range

I'm just starting to explore Xlrd, and to be honest am pretty new to programming altogether, and have been working through some of their simple examples, and can't get this simple code to work:
import xlrd
book=open_workbook('C:\\Users\\M\\Documents\\trial.xlsx')
sheet=book.sheet_by_index(1)
cell=sheet.cell(0,0)
print cell
I get an error: list index out of range (referring to the 2nd to last bit of code)
I cut and pasted most of the code from the pdf...any help?
You say:
I get an error: list index out of range (referring to the 2nd to last
bit of code)
I doubt it. How many sheets are there in the file? I suspect that there is only one sheet. Indexing in Python starts from 0, not 1. Please edit your question to show the full traceback and the full error message. I suspect that it will show that the IndexError occurs in the 3rd-last line:
sheet=book.sheet_by_index(1)
I would play around with it in the console.
Execute each statement one at a time and then view the result of each. The sheet indexes count from 0, so if you only have one worksheet then you're asking for the second one, and that will give you a list index out of range error.
Another thing that you might be missing is that not all cells exist if they don't have data in them. Some do, but some don't. Basically, the cells that exist from xlrd's standpoint are the ones in the matrix nrows x ncols.
Another thing is that if you actually want the values out of the cells, use the cell_value method. That will return you either a string or a float.
Side note, you could write your path like so: 'C:/Users/M/Documents/trial.xlsx'. Python will handle the / vs \ on the backend perfectly and you won't have to screw around with escape characters.