converting python pandas column to numpy array in place - python-2.7

I have a csv file in which one of the columns is a semicolon-delimited list of floating point numbers of variable length. For example:
Index List
0 900.0;300.0;899.2
1 123.4;887.3;900.1;985.3
when I read this into a pandas DataFrame, the datatype for that column is object. I want to convert it, ideally in place, to a numpy array (or just a regular float array, it doesn't matter too much at this stage).
I wrote a little function which takes a single one of those list elements and converts it to a numpy array:
def parse_list(data):
data_list = data.split(';')
return np.array(map(float, data_list))
This works fine, but what I want to do is do this conversion directly in the DataFrame so that I can use pandasql and the like to manipulate the whole data set after the conversion. Can someone point me in the right direction?
EDIT: I seem to have asked the question poorly. I would like to convert the following data frame:
Index List
0 900.0;300.0;899.2
1 123.4;887.3;900.1;985.3
where the dtype of List is 'object'
to the following dataframe:
Index List
0 [900.0, 300.0, 899.2]
1 [123.4, 887.3, 900.1, 985.3]
where the datatype of List is numpy array of floats
EDIT2: some progress, thanks to the first answer. I now have the line:
df['List'] = df['List'].str.split(';')
which splits the column in place into an array, but the dtypes remain object When I then try to do
df['List'] = df['List'].astype(float)
I get the error:
return arr.astype(dtype)
ValueError: setting an array element with a sequence.

If I understand you correctly, you want to transform your data from pandas to numpy arrays.
I used this:
pandas_DataName.as_matrix(columns=None)
And it worked for me.
For more information visit here
I hope this could help you.

Related

Search pandas column and return all elements (rows) that contain any (one or more) non-digit character

Seems pretty straight forward. The column contains numbers in general but for some reason, some of them have non-digit characters. I want to find all of them. I am using this code:
df_other_values.total_count.str.contains('[^0-9]')
but I get the following error:
AttributeError: Can only use .str accessor with string values, which use
np.object_ dtype in pandas
So I tried this:
df_other_values = df_other.total_countvalues
df_other_values.total_count.str.contains('[^0-9]')
but get the following error:
AttributeError: 'DataFrame' object has no attribute 'total_countvalues'
So instead of going down the rabbit hole further, I was thinking there must be a way to do this without having to change my dataframe into a np.object. Please advise.
Thanks.
I believe you need cast to strings first by astype and then filter by boolean indexing:
df1 = df[df_other_values.total_count.astype(str).str.contains('[^0-9]')]
Alternative solution with isnumeric:
df1 = df[~df_other_values.total_count.astype(str).str.isnumeric()]

Randomly set one-third of na's in a column to one value and the rest to another value

I'm trying to impute missing values in a dataframe df. I have a column A with 300 NaN's. I want to randomly set 2/3rd of it to value1 and the rest to value2.
Please help.
EDIT: I'm actually trying to this on dask, which does not support item assignment. This is what I have currently. Initially, I thought I'll try to convert all NA's to value1
da.where(df.A.isnull() == True, 'value1', df.A)
I got the following error:
ValueError: need more than 0 values to unpack
As the comment suggests, you can solve this with Series.where.
The following will work, but I cannot promise how efficient this is. (I suspect it may be better to produce a whole column of replacements at once with numpy.choice.)
df['A'] = d['A'].where(~d['A'].isnull(),
lambda df: df.map(
lambda x: random.choice(['value1', 'value1', x])))
explanation: if the value is not null (NaN), certainly keep the original. Where it is null, replace with the corresonding values of the dataframe produced by the first lambda. This maps values of the dataframe (chunks) to randomly choose the original value for 1/3 and 'value1' for others.
Note that, depending on your data, this likely has changed the data type of the column.

Datetime object through 'datetime.strptime is not iterable'

i have a csv file containing years of data, and i need to calculate the difference between the max date and the min date, i am facing a real problem in how can i determine the max value of dates.
So, i am doing this to convert my dates into datetime object
Temps = datetime.strptime(W['datum'][i]+' '+W['timestamp'][i],'%Y-%m-%d %H:%M:%S')
Printing this line, gives me the exact result i want, but when i try to extract the max values of these dates using this line of code :
start = max(Temps)
I got this error : datetime.strptime' object is not iterable
where am i mistaken ?
The expression
datetime.strptime(W['datum'][i]+' '+W['timestamp'][i],'%Y-%m-%d %H:%M:%S')
produces a single value (a scalar). When you assign it to Temps this variable become a scalar not a list. It contains only one value.
Then when you try to evaluate max(Temps) max is expecting to find something with multiple values as its argument but, unfortunately, it finds what Temps was assigned most recently.
This was a single value, which is not 'iterable'.

dataframe to dictionary:python

So, i have a file
F1.txt
CDUS,CBSCS,CTRS,CTRS_ID
0,0,0.000000000375,056572
0,0,4.0746,0309044
0,0,0.6182,0971094
0,0,15.4834,075614
I want to insert the column names and its dtype into a dictionary with the column names being the key and the corresponding dtype of the column being the value.
My read statement has to be like this:
csv=pandas.read_csv('F2.txt',dtype={'CTRS_ID':str})
I'm expecting something like this:
data = {'CDUS':'int64','CBSCS':'int64','CTRS':'float64','CTRS_ID':'str'}
Can someone help me with this. Thanks in advance
You can use dtypes to find the type of each column and then transform the result to a dictionary with to_dict. Also, if you want a string representation of the type, you can convert the dtypes output to string:
csv=pandas.read_csv('F2.txt',dtype={'CTRS_ID':str})
csv.dtypes.astype(str).to_dict()
Which gives the output:
{'CBSCS': 'int64', 'CDUS': 'int64', 'CTRS': 'float64', 'CTRS_ID': 'object'}
This is actually the right result, since pandas treats string as object.
I have not enough expertise to elaborate on this, but here a couple of references:
pandas distinction between str and object types
pandas string data types
"pandas doesn't support the internal string types (in fact they are always converted to object)" [from pandas maintainer #Jeff]

Convert pandas series into numpy array [duplicate]

This question already has answers here:
How do I convert a Pandas series or index to a NumPy array? [duplicate]
(8 answers)
Closed 4 years ago.
I am new to pandas and python. My input data is like
category text
1 hello iam fine. how are you
1 iam good. how are you doing.
inputData= pd.read_csv(Input', sep='\t', names=['category','text'])
X = inputData["text"]
Y = inputData["category"]
here Y is the panda series object, which i want to convert into numpy array. so i tried .as_matrix
YArray= Y.as_matrix(columns=None)
print YArray
But i got the output as [1,1] (which is wrong since i have only one column category and two rows). I want the result as 2x1 matrix.
To get numpy array, you need
Y.values
Try this:
after applying the .as_matrix on your series object
Y.reshape((2,1))
Since .as_matrix() only returns a numpy-array NOT a numpy-matrix.
Link here
If df is your dataframe, then a column of the dataframe is a series and to convert it into an array,
df = pd.DataFrame()
x = df.values
print(x.type)
The following prints,
<class 'numpy.ndarray'>
successfully converting it to an array.