Converting a pandas DataFrame column containing negative strings into float

Converting a pandas DataFrame column containing negative strings into float - python-2.7

I have a pandas Dataframe df that contains negative strings and i would like to convert them to float:
NY_resitor1 NY_resitor2 SF_type SF_resitor2
45 "-36" Resis 40
47 "36" curr 34
. . . .
49 "39" curr 39
45 "-11" curr 12
12 "-200" Resis 45
This is the code I wrote
df["NY_resitor2 "]=df["NY_resitor2 "].astype(float)
but I have the error:
ValueError: could not convert string to float: "-32"
what is the problem?

I think this might be a case of having a strange unicode version of "-" somewhere in your string data. For example, this should work:
>>> import pandas as pd
>>> ser = pd.Series(['-36', '36'])
>>> ser.astype(float)
0 -36
1 36
dtype: float64
But this doesn't, because I've replaced the standard minus sign with a U+2212 minus sign:
>>> ser2 = pd.Series(['−32', '36'])
>>> ser2.astype(float)
...
ValueError: could not convert string to float: '−32'
you could address this by specifically getting rid of the offending characters, using str.replace():
>>> ser2.str.replace('−', '-').astype(float)
0 -32
1 36
dtype: float64
If that's not the issue, then I don't know what is!
Edit: another possibility is that your strings could have quotes within them. e.g.
>>> ser3 = pd.Series(['"-36"', '"36"'])
>>> ser3.astype(float)
...
ValueError: could not convert string to float: '"-36"'
In this case, you need to strip these out first:
>>> ser3.str.replace('"', '').astype(float)
0 -36
1 36
dtype: float64

Related

Removing part of a value in a certain column in a dataframe , and returning a DF

I have the following Data Frame named: mydf:
A B
0 3de (1ABS) Adiran
1 3SA (SDAS) Adel
2 7A (ASA) Ronni
3 820 (SAAa) Emili
I want to remove the " (xxxx)" and keeps the values in column A , so the dataframe (mydf) will look like:
A B
0 3de Adiran
1 3SA Adel
2 7A Ronni
3 820 Emili
I have tried :
print mydf['A'].apply(lambda x: re.sub(r" \(.+\)", "", x) )
but then I get a Series object back and not a dataframe object.
I have also tried to use replace:
df.replace([' \(.*\)'],[""], regex=True), But it didn't change anything.
What am I doing wrong?
Thank you!

you can use str.split() method:
In [3]: df.A = df.A.str.split('\s+\(').str[0]
In [4]: df
Out[4]:
A B
0 3de Adiran
1 3SA Adel
2 7A Ronni
3 820 Emili
or using str.extract() method:
In [9]: df.A = df.A.str.extract(r'([^\(\s]*)', expand=False)
In [10]: df
Out[10]:
A B
0 3de Adiran
1 3SA Adel
2 7A Ronni
3 820 Emili

Not calculating sum for all columns in pandas dataframe

I'm pulling data from Impala using impyla, and converting them to dataframe using as_pandas. And I'm using Pandas 0.18.0, Python 2.7.9
I'm trying to calculate the sum of all columns in a dataframe and trying to select the columns which are greater than the threshold.
self.data = self.data.loc[:,self.data.sum(axis=0) > 15]
But when I run this I'm getting error like below:
pandas.core.indexing.IndexingError: Unalignable boolean Series key
provided
Then I tried like below.
print 'length : ',len(self.data.sum(axis = 0)),' all columns : ',len(self.data.columns)
Then i'm getting different length i.e
length : 78 all columns : 83
And I'm getting below warning
C:\Python27\lib\decimal.py:1150: RuntimeWarning: tp_compare didn't
return -1 or -2 for exception
And To achieve my goal i tried the other way
for column in self.data.columns:
sum = self.data[column].sum()
if( sum < 15 ):
self.data = self.data.drop(column,1)
Now i have got the other errors like below:
TypeError: unsupported operand type(s) for +: 'Decimal' and 'float'
C:\Python27\lib\decimal.py:1150: RuntimeWarning: tp_compare didn't return -1 or -2 for exception
Then i tried to get the data types of each column like below.
print 'dtypes : ', self.data.dtypes
The result has all the columns are one of these int64 , object and float 64
Then i thought of changing the data type of columns which are in object like below
self.data.convert_objects(convert_numeric=True)
Still i'm getting the same errors, Please help me in solving this.
Note : In all the columns I do not have strings i.e characters and missing values or empty.I have checked this using self.data.to_csv
As i'm new to pandas and python Please don't mind if it is a silly question. I just want to learn

Please review the simple code below and you may understand the reason of the error.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.random([3,3]))
df.iloc[0,0] = np.nan
print df
print df.sum(axis=0) > 1.5
print df.loc[:, df.sum(axis=0) > 1.5]
df.iloc[0,0] = 'string'
print df
print df.sum(axis=0) > 1.5
print df.loc[:, df.sum(axis=0) > 1.5]
0 1 2
0 NaN 0.336250 0.801349
1 0.930947 0.803907 0.139484
2 0.826946 0.229269 0.367627
0 True
1 False
2 False
dtype: bool
0
0 NaN
1 0.930947
2 0.826946
0 1 2
0 string 0.336250 0.801349
1 0.930947 0.803907 0.139484
2 0.826946 0.229269 0.367627
1 False
2 False
dtype: bool
Traceback (most recent call last):
...
pandas.core.indexing.IndexingError: Unalignable boolean Series key provided
Shortly, you need additional preprocess on your data.
df.select_dtypes(include=['object'])
If it's convertable string numbers, you can convert it by df.astype(), or you should purge them.

Find sum of the column values based on some other column

I have a input file like this:
j,z,b,bsy,afj,upz,343,13,ruhwd
u,i,a,dvp,ibt,dxv,154,00,adsif
t,a,a,jqj,dtd,yxq,540,49,kxthz
j,z,b,bsy,afj,upz,343,13,ruhwd
u,i,a,dvp,ibt,dxv,154,00,adsif
t,a,a,jqj,dtd,yxq,540,49,kxthz
c,u,g,nfk,ekh,trc,085,83,xppnl
For every unique value of Column1, I need to find out the sum of column7
Similarly, for every unique value of Column2, I need to find out the sum of column7
Output for 1 should be like:
j,686
u,308
t,98
c,83
Output for 2 should be like:
z,686
i,308
a,98
u,83
I am fairly new in Python. How can I achieve the above?

This could be done using Python's Counter and csv library as follows:
from collections import Counter
import csv
c1 = Counter()
c2 = Counter()
with open('input.csv') as f_input:
for cols in csv.reader(f_input):
col7 = int(cols[6])
c1[cols[0]] += col7
c2[cols[1]] += col7
print "Column 1"
for value, count in c1.iteritems():
print '{},{}'.format(value, count)
print "\nColumn 2"
for value, count in c2.iteritems():
print '{},{}'.format(value, count)
Giving you the following output:
Column 1
c,85
j,686
u,308
t,1080
Column 2
i,308
a,1080
z,686
u,85
A Counter is a type of Python dictionary that is useful for counting items automatically. c1 holds all of the column 1 entries and c2 holds all of the column 2 entries. Note, Python numbers lists starting from 0, so the first entry in a list is [0].
The csv library loads each line of the file into a list, with each entry in the list representing a different column. The code takes column 7 (i.e. cols[6]) and converts it into an integer, as all columns are held as strings. It is then added to the counter using either the column 1 or 2 value as the key. The result is two dictionaries holding the totaled counts for each key.

You can use pandas:
df = pd.read_csv('my_file.csv', header=None)
print(df.groupby(0)[6].sum())
print(df.groupby(1)[6].sum())
Output:
0
c 85
j 686
t 1080
u 308
Name: 6, dtype: int64
1
a 1080
i 308
u 85
z 686
Name: 6, dtype: int64
The data frame should look like this:
print(df.head())
Output:
0 1 2 3 4 5 6 7 8
0 j z b bsy afj upz 343 13 ruhwd
1 u i a dvp ibt dxv 154 0 adsif
2 t a a jqj dtd yxq 540 49 kxthz
3 j z b bsy afj upz 343 13 ruhwd
4 u i a dvp ibt dxv 154 0 adsif
You can also use your own names for the columns. Like c1, c2, ... c9:
df = pd.read_csv('my_file.csv', index_col=False, names=['c' + str(x) for x in range(1, 10)])
print(df)
Output:
c1 c2 c3 c4 c5 c6 c7 c8 c9
0 j z b bsy afj upz 343 13 ruhwd
1 u i a dvp ibt dxv 154 0 adsif
2 t a a jqj dtd yxq 540 49 kxthz
3 j z b bsy afj upz 343 13 ruhwd
4 u i a dvp ibt dxv 154 0 adsif
5 t a a jqj dtd yxq 540 49 kxthz
6 c u g nfk ekh trc 85 83 xppnl
Now, group by column 1 c1 or column c2 and sum up column 7 c7:
print(df.groupby(['c1'])['c7'].sum())
print(df.groupby(['c2'])['c7'].sum())
Output:
c1
c 85
j 686
t 1080
u 308
Name: c7, dtype: int64
c2
a 1080
i 308
u 85
z 686
Name: c7, dtype: int64

SO isn't supposed to be a code writing service, but I had a few minutes. :) Without Pandas you can do it with the CSV module;
import csv
def sum_to(results, key, add_value):
if key not in results:
results[key] = 0
results[key] += int(add_value)
column1_results = {}
column2_results = {}
with open("input.csv", 'rt') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
sum_to(column1_results, row[0], row[6])
sum_to(column2_results, row[1], row[6])
print column1_results
print column2_results
Results:
{'c': 85, 'j': 686, 'u': 308, 't': 1080}
{'i': 308, 'a': 1080, 'z': 686, 'u': 85}
Your expected results don't seem to match the math that Mike's answer and mine got using your spec. I'd double check that.

Split if two consecutive characters are not same

I have a input string like
a = '4433555555666666'
i want the values to be separated if last character is not same as the next one.
in this case:
44, 33, 555555, 666666
I'm new in python so don't know how to deal with it. I have tried but it just gives first one correct i.e.
['44', '', '555555666666']
Also if two consecutive character group is same.
i.e.
a = 'chchdfch'
then 'ch' should be replaced with
a = '**df*'

You can use itertools.groupby()
[''.join(v) for k, v in itertools.groupby(a)]
Demo:
>>> import itertools
>>> a = '4433555555666666'
>>> [''.join(value) for key, value in itertools.groupby(a)]
['44', '33', '555555', '666666']
So this code is called a list comprehension - a compact way of iterating over elements individually.
Another way of representing this is:
>>> for k, v in itertools.groupby(a):
... print k, v
...
4 <itertools._grouper object at 0x100b90710>
3 <itertools._grouper object at 0x100b90750>
5 <itertools._grouper object at 0x100b90710>
6 <itertools._grouper object at 0x100b90750>
>>> for k, v in itertools.groupby(a):
... print k, "".join(v)
...
4 44
3 33
5 555555
6 666666
>>>
Just ignore the k the iterator generates.

Having problems converting strings in a pandas series to lowercase

I was able to do this in the DataFrame using a lambda function with map(lambda x: x.lower()). I tried to use a lambda function with pd.series.apply() but that didn't work. Also when I try to isolate the column in series with something like series['A'] should it return the index(although I guess this makes sense) because I get a float error even though the values that I want to apply the lower method to are strings. Any help would be appreciated.

You can use the Series vectorised string methods, which includes lower:
In [11]: df = pd.DataFrame([['A', 'B'], ['C', 4]], columns=['X', 'Y'])
In [12]: df
Out[12]:
X Y
0 A B
1 C 4
In [13]: df.X.str.lower()
Out[13]:
0 a
1 c
Name: X, dtype: object
In [14]: df.Y.str.lower()
Out[14]:
0 b
1 NaN
Name: Y, dtype: object

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Converting a pandas DataFrame column containing negative strings into float - python-2.7

Related

Removing part of a value in a certain column in a dataframe , and returning a DF

Not calculating sum for all columns in pandas dataframe

Find sum of the column values based on some other column

Split if two consecutive characters are not same

Having problems converting strings in a pandas series to lowercase

Categories

Resources