So I'm trying to create two arrays using numpy. One array is a lot bigger than the other, so I want to search the large array to see where each element in my small array are located (i.e. what index). However when I run the code below, one of the elements in the small array cannot be found and I'm not sure why. Is it a data type mismatch?
Please advise, thank you!
import matplotlib.pyplot as plt
import numpy as np
GMean = np.array([4.23, 4.93, 5.67, 6.62, 4.67])
conc_x = np.arange(0.0, 90, 0.1)
GMean = np.round(GMean, decimals=1)
for i in np.nditer(GMean):
spec_index = np.where(conc_x==i) #look for index in conc_x data set where our GMean data point lies
print i
print spec_index
console output:
4.2
(array([42]),)
4.9
(array([49]),)
5.7
(array([57]),)
6.6
(array([], dtype=int32),) #why can it not find the index here?
4.7
(array([47]),)
So using numpy.around() instead of numpy.round() works. I get an index every time.
I thought they were the same but looking at the documentation, there is a subtle difference:
"Round an array to the given number of decimals."
vs:
"Evenly round to the given number of decimals."
So i'm thinking "evenly round" means it is rounding all trailing digits beyond the desired decimal, and therefore, both numbers you are comparing become exactly the same.
Hope this makes sense.
I have to find unique integers from list of integers Node1ID ,containing over 80000 integer values. I have following code that works. However it is very slow. Takes over 5 minutes to execute. Looking for a faster way . Can somebody help?
Here is my code:
output = []
for x in Node1ID:
if x not in output:
output.append(x)
Thanks
Create a set from the list, then convert that back to a list.
output = list(set(Node1D))
Sets cannot contain duplicate elements, so the first conversion gets rid of all the duplicates.
Never Mind Looks like I just got it
answer is following
import numpy as np
unique=np.unique([Node1ID])
executes in less than 0.1 seconds
thanks
I am trying to run this in quantlib-python:
import QuantLib as ql
date = ql.Date(31, 3, 2015)
date
returning: Date(31,3,2015)
but it is supposed to return: March 31st, 2015
I am new to quantlib-python. What am I missing? Thank you.
I am using VC2015/quantlib 1.8/quantlib-swig-1.8
When you only write date, your interpreter calls the __repr__ method and displays the result. If you say print date, instead, it calls the __str__ method, which is what you're looking for. The two methods have different purposes in Python (see e.g. Purpose of Python's __repr__) and are often implemented differently; you can try the same thing with datetime.date and see what happens.
I would like to use a number of features to train with Naive Bayes classifier to classify 'A' or 'non-A'.
I have three features of different value types:
1) total_length - in positive integer
2) vowel-ratio - in decimal/fraction
3) twoLetters_lastName - a array containing multiple two-letters strings
# coding=utf-8
from nltk.corpus import names
import nltk
import random
import numpy as np
import pandas as pd
from pandas import DataFrame, Series
from sklearn.naive_bayes import GaussianNB
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
# Import data into pandas
data = pd.read_csv('XYZ.csv', header=0, encoding='utf-8',
low_memory=False)
df = DataFrame(data)
# Randomize records
df = df.reindex(np.random.permutation(df.index))
# Assign column into label Y
df_Y = df[df.AScan.notnull()][['AScan']].values # Labels are 'A' or 'non-A'
#print df_Y
# Assign column vector into attribute X
df_X = df[df.AScan.notnull()][['total_length', 'vowel_ratio', 'twoLetters_lastName']].values
#print df_X[0:10]
# Incorporate X and Y into ML algorithms
clf = GaussianNB()
clf.fit(df_X, df_Y)
df_Y is as follow:
[[u'non-A']
[u'A']
[u'non-A']
...,
[u'A']
[u'non-A']
[u'non-A']]
df_X is below:
[[9L 0.222222222 u"[u'ke', u'el', u'll', u'ly']"]
[17L 0.41176470600000004
u"[u'ma', u'ar', u'rg', u'ga', u'ar', u'ri', u'is']"]
[11L 0.454545455 u"[u'du', u'ub', u'bu', u'uc']"]
[11L 0.454545455 u"[u'ma', u'ah', u'he', u'er']"]
[15L 0.333333333 u"[u'ma', u'ag', u'ge', u'ee']"]
[13L 0.307692308 u"[u'jo', u'on', u'ne', u'es']"]
[12L 0.41666666700000005
u"[u'le', u'ef', u'f\\xe8', u'\\xe8v', u'vr', u're']"]
[15L 0.26666666699999997 u"[u'ni', u'ib', u'bl', u'le', u'et', u'tt']"]
[15L 0.333333333 u"[u'ki', u'in', u'ns', u'sa', u'al', u'll', u'la']"]
[11L 0.363636364 u"[u'mc', u'cn', u'ne', u'ei', u'il']"]]
I am getting this error:
E:\Program Files Extra\Python27\lib\site-packages\sklearn\naive_bayes.py:150: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
Traceback (most recent call last):
File "C:werwer\wer\wer.py", line 32, in <module>
clf.fit(df_X, df_Y)
File "E:\Program Files Extra\Python27\lib\site-packages\sklearn\naive_bayes.py", line 163, in fit
self.theta_[i, :] = np.mean(Xi, axis=0)
File "E:\Program Files Extra\Python27\lib\site-packages\numpy\core\fromnumeric.py", line 2727, in mean
out=out, keepdims=keepdims)
File "E:\Program Files Extra\Python27\lib\site-packages\numpy\core\_methods.py", line 69, in _mean
ret, rcount, out=ret, casting='unsafe', subok=False)
TypeError: unsupported operand type(s) for /: 'unicode' and 'long'
My understanding is I need to convert the features into one numpy array as a feature vector, but I don't think if I am preparing this X vector right since it contains very different value types.
Related questions: Choosing a Classification Algorithm to Classify Mix of Nominal and Numeric Data -- Mixing Categorial and Continuous Data in Naive Bayes Classifier Using Scikit-learn
Okay so there are a few things going on. As DalekSec pointed out, it's best practice to keep all your features as one type as you input them into a model like GaussianNB. The traceback indicates that while fitting the model, it tries to divide a string (presumably one of your unicode strings like u"[u'ke', u'el', u'll', u'ly']") by an integer. So what we need to do is convert the training data into a form that sklearn can use. We can do this a few ways, two of which ogrisel eloquently describes in this answer here.
We can convert all the continuous variables to categorical variables. In our case, this means converting total_length (in some cases you could probably treat this as a categorical variable, but let's not get ahead of ourselves) and vowel-ratio. For instance, you can basically bin the values you see in each feature to one of 5 values based on percentile: 'very small', 'small', 'medium', 'high', 'very high'. There's no real easy way in sk-learn as far as I know, but it should be pretty straightforward to do it yourself. The only thing that you would want to change is that you would want to use MultinomialNB instead of GaussianNB because you'll be dealing with features that would be better described by multinomial distributions rather than gaussian ones.
We can convert the categorical features to numeric ones for use with GaussianNB. Personally I find this to be the more intuitive approach. Basically, when dealing with text, you need to figure out what information you want to take from the text and pass to the classifier. It looks like to me that you want to extract the incidence of different two letter last names.
Normally I would ask you whether or not you have all the last names in your dataset, but since each one is only two letters each we can just store all the possible two letter names (including the unicode characters involving accent marks) with a minimal impact on performance. This is where something like sklearn's CountVectorizer might be useful. Assuming that you have every possible combination of two letter last names in your data, you can just directly use this to turn a row in your twoLetter_lastname column into a N-dimensional vector that records the number of occurrences of each unique last name in your row. Then just combine this new vector with your other two features into a numpy array.
In the case you do not have every possible combination of two letters (including accented ones), you should consider generating that list and pass it in as the 'vocabulary' for the CountVectorizer. This is so that your classifier knows how to handle all possible last names. It's not the end of the world if you don't handle all cases, but any new unseen two letter pairs will be ignored in this scheme.
Before you use these tools, you should make sure that you pass your last name column in as a list, and not as a string, as this can result in unintended behavior.
You can read more about general sklearn preprocessing here, and more about CountVectorizer and other text feature extraction tools provided by sklearn here. I use a lot of these tools daily, and recommend them for basic text extraction tasks. There are also plenty of tutorials and demos available online. You might also look for other types of methods of representation, like binarizing and one-hot encoding. There are many ways to solve this problem, it mostly depends on your specific problem/needs.
After you're able to turn all your data into one form or the other, you should be able to make use of either the Gaussian or Multinomial NB classifier. As for your error regarding the 1D vector, you printed df_Y and it looked like
[[u'non-A']
[u'A']
[u'non-A']
...,
[u'A']
[u'non-A']
[u'non-A']]
Basically, it's expecting this to be in a flat list, rather than as a column vector (a list of one-dimensional lists). Just reshape it accordingly by making use of commands like numpy.reshape() or numpy.ravel() (numpy.ravel() would probably be more appropriate, considering that you're dealing with just one column, as the error mentioned).
I'm not 100% sure, but I think scikit-learn.naive_bayes requires a purely numeric feature vector instead of a mixture of text and numbers. It looks like it crashes when trying to "divide" a unicode string by a long integer.
I can't be much help with finding numeric representations for text, but this scikit-learn tutorial might be a good start.
I am trying to round up a number using math module in python.
So when I do,
print math.ceil(21/10)
I get '2.0' which is right.
print math.ceil(27/10)
I still get '2.0'
I want to get 3, since it is closest to 2.7
Could someone please advise a workaround.
Thanks in advance.
You are being surprised by the division operator in Python 2.x. With integers, it does integer division; 21/10 results in 2 and 27/10 results in 2.
Use 21.0/10 and 27.0/10 and you will get the correct answers.
In Python 3.x, division of integers will automatically promote to float if the division isn't even (there would be a remainder). You can get this behavior in Python 2.7 by using from __future__ import division.
By the way, pretty sure the integer ceiling of 21/10 should be 3.
I think you want round:
from __future__ import division
print round(27/10)
3.0
print round(21/10)
2.0
math.ceil will always round up, round will round to the nearest
You only get 2 from math.ceil(21/10) because of how python2 handles integer division.
21/10 in python2 is 2
The problem is Python thinks 27/10 are integers and so the evaluates that as 2. If you write 27/10.0 it will make them floats and the thing will work as you want.