Convert pandas series into numpy array [duplicate] - python-2.7

This question already has answers here:
How do I convert a Pandas series or index to a NumPy array? [duplicate]
(8 answers)
Closed 4 years ago.
I am new to pandas and python. My input data is like
category text
1 hello iam fine. how are you
1 iam good. how are you doing.
inputData= pd.read_csv(Input', sep='\t', names=['category','text'])
X = inputData["text"]
Y = inputData["category"]
here Y is the panda series object, which i want to convert into numpy array. so i tried .as_matrix
YArray= Y.as_matrix(columns=None)
print YArray
But i got the output as [1,1] (which is wrong since i have only one column category and two rows). I want the result as 2x1 matrix.

To get numpy array, you need
Y.values

Try this:
after applying the .as_matrix on your series object
Y.reshape((2,1))
Since .as_matrix() only returns a numpy-array NOT a numpy-matrix.
Link here

If df is your dataframe, then a column of the dataframe is a series and to convert it into an array,
df = pd.DataFrame()
x = df.values
print(x.type)
The following prints,
<class 'numpy.ndarray'>
successfully converting it to an array.

Related

Pandas re-arranges DataFrame Columns? [duplicate]

This question already has answers here:
Pandas dict to dataframe - columns out of order?
(2 answers)
Closed 6 years ago.
Not asking how to re-arrange per se; rather noting that pandas is changing the order given:
Quite a surprise: it alphabetizes! Can you reject that and enforce your own order?
(python 2.7.11/anaconda/pandas 0.18.0/os 10.9.4)
I think it is not possible, because your input is a dictionary, the items in dictionary are not ordered. So I would just simply give:
my_order = ["dist", "dem"]
df1 = pandas.DataFrame({"dist":xm, "dem":ym}, columns=my_order)

How to use the dimension of a python matrix in a loop

I am working with a matrix, lets call it X, in python.
I know how to get the dimension of the matrix using X.shape but I am interested specially on using the number of rows of the matrix in a for loop, and I dont know how to get this value in a datatype suitable for a loop.
For example, imagine tihs simple situation:
a = np.matrix([[1,2,3],[4,5,6]])
for i in 1:(number of rows of a)
print i
How can I get automatically that "number of rows of a"?
X.shape[0] == number of rows in X
A superficial search on numpy will lead you to shape. It returns a tuple of array dimensions.
In your case, the first dimension (axe) concerns the columns. You can access it as you access a tuple's element:
import numpy as np
a = np.matrix([[1,2,3],[4,5,6]])
# a. shape[1]: columns
for i in range(0,a.shape[1]):
print 'column '+format(i)
# a. shape[0]: rows
for i in range(0, a.shape[0]):
print 'row '+format(i)
This will print:
column 0
column 1
column 2
row 0
row 1

converting python pandas column to numpy array in place

I have a csv file in which one of the columns is a semicolon-delimited list of floating point numbers of variable length. For example:
Index List
0 900.0;300.0;899.2
1 123.4;887.3;900.1;985.3
when I read this into a pandas DataFrame, the datatype for that column is object. I want to convert it, ideally in place, to a numpy array (or just a regular float array, it doesn't matter too much at this stage).
I wrote a little function which takes a single one of those list elements and converts it to a numpy array:
def parse_list(data):
data_list = data.split(';')
return np.array(map(float, data_list))
This works fine, but what I want to do is do this conversion directly in the DataFrame so that I can use pandasql and the like to manipulate the whole data set after the conversion. Can someone point me in the right direction?
EDIT: I seem to have asked the question poorly. I would like to convert the following data frame:
Index List
0 900.0;300.0;899.2
1 123.4;887.3;900.1;985.3
where the dtype of List is 'object'
to the following dataframe:
Index List
0 [900.0, 300.0, 899.2]
1 [123.4, 887.3, 900.1, 985.3]
where the datatype of List is numpy array of floats
EDIT2: some progress, thanks to the first answer. I now have the line:
df['List'] = df['List'].str.split(';')
which splits the column in place into an array, but the dtypes remain object When I then try to do
df['List'] = df['List'].astype(float)
I get the error:
return arr.astype(dtype)
ValueError: setting an array element with a sequence.
If I understand you correctly, you want to transform your data from pandas to numpy arrays.
I used this:
pandas_DataName.as_matrix(columns=None)
And it worked for me.
For more information visit here
I hope this could help you.

Adding data to a Pandas dataframe

I have a dataframe that contains Physician_Profile_City, Physician_Profile_State and Physician_Profile_Zip_Code. I ultimately want to stratify an analysis based on state, but unfortunately not all of the Physician_Profile_States are filled in. I started looking around to try and figure out how to fill in the missing States. I came across the pyzipcode module which can take as an input a zip code and returns the state as follows:
In [39]: from pyzipcode import ZipCodeDatabase
zcdb = ZipCodeDatabase()
zcdb = ZipCodeDatabase()
zipcode = zcdb[54115]
zipcode.state
Out[39]: u'WI'
What I'm struggling with is how I would iterate through the dataframe and add the appropriate "Physician_Profile_State" when that variable is missing. Any suggestions would be most appreciated.
No need to iterate if the form of the data is a dict then you should be able to perform the following:
df['Physician_Profile_State'] = df['Physician_Profile_Zip_Code'].map(zcdb)
Otherwise you can call apply like so:
df['Physician_Profile_State'] = df['Physician_Profile_Zip_Code'].apply(lambda x: zcdb[x].state)
In the case where the above won't work as it can't generate a Series to align with you df you can apply row-wise passing axis=1 to the df:
df['Physician_Profile_State'] = df[['Physician_Profile_Zip_Code']].apply(lambda x: zcdb[x].state, axis=1)
By using double square brackets we return a df allowing you to pass the axis param

Theano get unique values in a tensor

I have a tensor which I convert into a vector by flattening, now I want to remove the duplicate values in this vector. How can I do this? What is equivalent for numpy.unique() in theano?
x1 = T.itensor3('x1')
y1 = T.flatten(x1)
#z1 = T.unique() How do I do this?
For e.g. my tensor may be : [1,1,2,3,3,4,4,5,1,3,4]
and I want : [1,2,3,4,5]
EDIT: this is now available in Theano: http://deeplearning.net/software/theano/library/tensor/extra_ops.html#theano.tensor.extra_ops.Unique
This question was also asked on theano-user mailing list. The conclusion is that this is one of the function NumPy function that isn't wrapped in Theano. As he don't need the grad, it can be rapidly wrapped. Here is an example who expect the outputs to be the same as the input.
from theano.compile.ops import as_op
#as_op(itypes=[theano.tensor.imatrix],
otypes=[theano.tensor.imatrix])
def numpy_unique(a):
return numpy.unique(a)
More doc about as_op is available here: http://deeplearning.net/software/theano/tutorial/extending_theano.html#as-op-example