dataframe to dictionary:python - python-2.7

So, i have a file
F1.txt
CDUS,CBSCS,CTRS,CTRS_ID
0,0,0.000000000375,056572
0,0,4.0746,0309044
0,0,0.6182,0971094
0,0,15.4834,075614
I want to insert the column names and its dtype into a dictionary with the column names being the key and the corresponding dtype of the column being the value.
My read statement has to be like this:
csv=pandas.read_csv('F2.txt',dtype={'CTRS_ID':str})
I'm expecting something like this:
data = {'CDUS':'int64','CBSCS':'int64','CTRS':'float64','CTRS_ID':'str'}
Can someone help me with this. Thanks in advance

You can use dtypes to find the type of each column and then transform the result to a dictionary with to_dict. Also, if you want a string representation of the type, you can convert the dtypes output to string:
csv=pandas.read_csv('F2.txt',dtype={'CTRS_ID':str})
csv.dtypes.astype(str).to_dict()
Which gives the output:
{'CBSCS': 'int64', 'CDUS': 'int64', 'CTRS': 'float64', 'CTRS_ID': 'object'}
This is actually the right result, since pandas treats string as object.
I have not enough expertise to elaborate on this, but here a couple of references:
pandas distinction between str and object types
pandas string data types
"pandas doesn't support the internal string types (in fact they are always converted to object)" [from pandas maintainer #Jeff]

Related

Search pandas column and return all elements (rows) that contain any (one or more) non-digit character

Seems pretty straight forward. The column contains numbers in general but for some reason, some of them have non-digit characters. I want to find all of them. I am using this code:
df_other_values.total_count.str.contains('[^0-9]')
but I get the following error:
AttributeError: Can only use .str accessor with string values, which use
np.object_ dtype in pandas
So I tried this:
df_other_values = df_other.total_countvalues
df_other_values.total_count.str.contains('[^0-9]')
but get the following error:
AttributeError: 'DataFrame' object has no attribute 'total_countvalues'
So instead of going down the rabbit hole further, I was thinking there must be a way to do this without having to change my dataframe into a np.object. Please advise.
Thanks.
I believe you need cast to strings first by astype and then filter by boolean indexing:
df1 = df[df_other_values.total_count.astype(str).str.contains('[^0-9]')]
Alternative solution with isnumeric:
df1 = df[~df_other_values.total_count.astype(str).str.isnumeric()]

Formatting thousand separator for numbers in a pandas dataframe

I am trying to write a dataframe to a csv and I would like the .csv to be formatted with commas. I don't see any way on the to_csv docs to use a format or anything like this.
Does anyone know a good way to be able to format my output?
My csv output looks like this:
12172083.89 1341.4078 -9568703.592 10323.7222
21661725.86 -1770.2725 12669066.38 14669.7118
I would like it to look like this:
12,172,083.89 1,341.4078 -9,568,703.592 10,323.7222
21,661,725.86 -1,770.2725 12,669,066.38 14,669.7118
Comma is the default separator. If you want to choose your own separator you can do this by declaring the sep parameter of pandas to_csv() method.
df.to_csv(sep=',')
If you goal is to create thousand separators and export them back into a csv you can follow this example:
import pandas as pd
df = pd.DataFrame([[12172083.89, 1341.4078, -9568703.592, 10323.7222],
[21661725.86, -1770.2725, 12669066.38, 14669.7118]],columns=['A','B','C','D'])
for c in df.columns:
df[c] = df[c].apply(lambda x : '{0:,}'.format(x))
df.to_csv(sep='\t')
If you just want pandas to show separators when printed out:
pd.options.display.float_format = '{:,}'.format
print(df)
What you're looking to do has nothing to do with csv output but rather is related to the following:
print('{0:,}'.format(123456789000000.546776362))
produces
123,456,789,000,000.546776362
See format string syntax.
Also, you'd do well to pay heed to #Peter 's comment above about compromising the structure of a csv in the first place.

Randomly set one-third of na's in a column to one value and the rest to another value

I'm trying to impute missing values in a dataframe df. I have a column A with 300 NaN's. I want to randomly set 2/3rd of it to value1 and the rest to value2.
Please help.
EDIT: I'm actually trying to this on dask, which does not support item assignment. This is what I have currently. Initially, I thought I'll try to convert all NA's to value1
da.where(df.A.isnull() == True, 'value1', df.A)
I got the following error:
ValueError: need more than 0 values to unpack
As the comment suggests, you can solve this with Series.where.
The following will work, but I cannot promise how efficient this is. (I suspect it may be better to produce a whole column of replacements at once with numpy.choice.)
df['A'] = d['A'].where(~d['A'].isnull(),
lambda df: df.map(
lambda x: random.choice(['value1', 'value1', x])))
explanation: if the value is not null (NaN), certainly keep the original. Where it is null, replace with the corresonding values of the dataframe produced by the first lambda. This maps values of the dataframe (chunks) to randomly choose the original value for 1/3 and 'value1' for others.
Note that, depending on your data, this likely has changed the data type of the column.

converting python pandas column to numpy array in place

I have a csv file in which one of the columns is a semicolon-delimited list of floating point numbers of variable length. For example:
Index List
0 900.0;300.0;899.2
1 123.4;887.3;900.1;985.3
when I read this into a pandas DataFrame, the datatype for that column is object. I want to convert it, ideally in place, to a numpy array (or just a regular float array, it doesn't matter too much at this stage).
I wrote a little function which takes a single one of those list elements and converts it to a numpy array:
def parse_list(data):
data_list = data.split(';')
return np.array(map(float, data_list))
This works fine, but what I want to do is do this conversion directly in the DataFrame so that I can use pandasql and the like to manipulate the whole data set after the conversion. Can someone point me in the right direction?
EDIT: I seem to have asked the question poorly. I would like to convert the following data frame:
Index List
0 900.0;300.0;899.2
1 123.4;887.3;900.1;985.3
where the dtype of List is 'object'
to the following dataframe:
Index List
0 [900.0, 300.0, 899.2]
1 [123.4, 887.3, 900.1, 985.3]
where the datatype of List is numpy array of floats
EDIT2: some progress, thanks to the first answer. I now have the line:
df['List'] = df['List'].str.split(';')
which splits the column in place into an array, but the dtypes remain object When I then try to do
df['List'] = df['List'].astype(float)
I get the error:
return arr.astype(dtype)
ValueError: setting an array element with a sequence.
If I understand you correctly, you want to transform your data from pandas to numpy arrays.
I used this:
pandas_DataName.as_matrix(columns=None)
And it worked for me.
For more information visit here
I hope this could help you.

Converting a string of numbers to hex and back to dec pandas python

I currently have a string of values which I retrieved after filtering through data from a csv file. ultimately I had to do some filtering of the data but I have the same numbers as a list, dataframe, or array. I just need to take the numbers in the string and convert them to hex and then take the first 8 numbers of the hex and convert that to dec for each element in the string. Lastly I also need to convert the last 8 of the same hex and then to dec as well for each value in the string.
I cannot provide a snippet because it is sensitive data, but here is an example.
I basically have something like this
>>> list_A
[52894036, 78893201, 45790373]
If I convert it to a dataframe and call df.dtypes, it says dtype: object and I can convert the values of Column A to bool, int, or string, but the dtype is always an object.
It does not matter whether it is a function, or just a simple loop. I have been trying many methods and am unable to attain the results I need. But ultimately the data is taken from different csv files and will never be the same values or list size.
Pandas is designed to work primarily with integers and floats, with no particular facilities for hexadecimal that I know of, but you can use apply to access standard python conversion functions like hex and int:
df=pd.DataFrame({ 'a':[52894036999, 78893201999, 45790373999] })
df['b'] = df['a'].apply( hex )
df['c'] = df['b'].apply( int, base=0 )
Results:
a b c
0 52894036999 0xc50baf407 52894036999
1 78893201999 0x125e66ba4f 78893201999
2 45790373999 0xaa951a86f 45790373999
Note that this answer is for Python 3. For Python 2 you may need to strip off the trailing "L" in column "b" with str[:-1].