Strange behaviour when adding columns - python-2.7

I'm using Python 2.7.8 |Anaconda 2.1.0. I'm wondering why the strange behavior below occurs
I create a pandas dataframe with two columns, then add a third column by summing the first two columns
x = pd.DataFrame(np.random.randn(5, 2), columns = ['a', 'b'])
x['c'] = x[['a', 'b']].sum(axis = 1) #or x['c'] = x['a'] + x['b']
Out[7]:
a b c
0 -1.644246 0.851602 -0.792644
1 -0.129092 0.237140 0.108049
2 0.623160 0.105494 0.728654
3 0.737803 -1.612189 -0.874386
4 0.340671 -0.113334 0.227337
All good so far. Now I want to set the values of column c to zero if they are negative
x[x['c']<0] = 0
Out[9]:
a b c
0 0.000000 0.000000 0.000000
1 -0.129092 0.237140 0.108049
2 0.623160 0.105494 0.728654
3 0.000000 0.000000 0.000000
4 0.340671 -0.113334 0.227337
This gives the desired result in column 'c', but for some reason columns 'a' and 'b' have been modified - i don't want this to happen. I was wondering why this is happening and how I can fix this behavior?

You have to specify you only want the 'c' column:
x.loc[x['c']<0, 'c'] = 0
When you just index with a boolean array/series, this will select full rows, as you can see in this example:
In [46]: x['c']<0
Out[46]:
0 True
1 False
2 False
3 True
4 False
Name: c, dtype: bool
In [47]: x[x['c']<0]
Out[47]:
a b c
0 -0.444493 -0.592318 -1.036811
3 -1.363727 -1.572558 -2.936285

Because you are setting to zero for all the columns. You should set it only for column c
x['c'][x['c']<0] = 0

Related

Keeping some data from duplicates and adding to existing python dataframe

I have an issue with keeping some data from duplicates and wanting to add valuable information to a new column in the dataframe.
import pandas as pd
data = {'id':[1,1,2,2,3],'key':[1,1,2,2,1],'value0':['a', 'b', 'x', 'y', 'a']}
frame = pd.DataFrame(data, columns = ['id','key','value0'])
print frame
Yields:
id key value0
0 1 1 a
1 1 1 b
2 2 2 x
3 2 2 y
4 3 1 a
Desired Output:
key value0_0 value0_1 value1_0
0 1 a b a
1 2 x y None
The "id" column isn't important to keep but could help with iteration and grouping.
I think this could be adapted to other projects where you don't know how many values exist for a set of keys.
set_index including a cumcount and unstack
frame.set_index(
['key', frame.groupby('key').cumcount()]
).value0.unstack().add_prefix('value0_').reset_index()
key value0_0 value0_1 value0_2
0 1 a b a
1 2 x y None
I'm questioning your column labeling but here is an approach using binary
frame.set_index(
['key', frame.groupby('key').cumcount()]
).value0.unstack().rename(
columns='{:02b}'.format
).add_prefix('value_').reset_index()
key value_00 value_01 value_10
0 1 a b a
1 2 x y None

scikit-learn: One hot encoding of column with list values [duplicate]

This question already has answers here:
How to one hot encode variant length features?
(2 answers)
Closed 5 years ago.
I am trying to encode a dataframe like below:
A B C
2 'Hello' ['we', are', 'good']
1 'All' ['hello', 'world']
Now as you can see I can labelencod string values of second column, but I am not able to figure out how to go about encode the third column which is having list of string values and length of the lists are different. Even if i onehotencode this i will get an array which i dont know how to merge with array elements of other columns after encoding. Please suggest some good technique
Assuming we have the following DF:
In [31]: df
Out[31]:
A B C
0 2 Hello [we, are, good]
1 1 All [hello, world]
Let's use sklearn.feature_extraction.text.CountVectorizer
In [32]: from sklearn.feature_extraction.text import CountVectorizer
In [33]: vect = CountVectorizer()
In [34]: X = vect.fit_transform(df.C.str.join(' '))
In [35]: df = df.join(pd.DataFrame(X.toarray(), columns=vect.get_feature_names()))
In [36]: df
Out[36]:
A B C are good hello we world
0 2 Hello [we, are, good] 1 1 0 1 0
1 1 All [hello, world] 0 0 1 0 1
alternatively you can use sklearn.preprocessing.MultiLabelBinarizer as #VivekKumar suggested in this comment
In [56]: from sklearn.preprocessing import MultiLabelBinarizer
In [57]: mlb = MultiLabelBinarizer()
In [58]: X = mlb.fit_transform(df.C)
In [59]: df = df.join(pd.DataFrame(X, columns=mlb.classes_))
In [60]: df
Out[60]:
A B C are good hello we world
0 2 Hello [we, are, good] 1 1 0 1 0
1 1 All [hello, world] 0 0 1 0 1

How should I check more than 10 columns for nan values and select those rows having nan values, ie keepna() instead of dropna()

Output = df[df['TELF1'].isnull() | df['STCEG'].isnull() | df['STCE1'].isnull()]
This is my code I am checking here if a column contains nan value than only select that row. But here I have over 10 columns to do that. This will make my code huge. Is there any short or more pythonic way to do it.
df.dropna(subset=['STRAS','ORT01','LAND1','PSTLZ','STCD1','STCD2','STCEG','TELF1','BANKS','BANKL','BANKN','E-MailAddress'])
Is there any way to get the opposite of the above command.It will give me the same output what I was trying above but it was getting very long.
Using loc with a simple boolean filter should work:
df = pd.DataFrame(np.random.random((5,4)), columns=list('ABCD'))
subset = ['C', 'D']
df.at[0, 'C'] = None
df.at[4, 'D'] = None
>>> df
A B C D
0 0.985707 0.806581 NaN 0.373860
1 0.232316 0.321614 0.606824 0.439349
2 0.956236 0.169002 0.989045 0.118812
3 0.329509 0.644687 0.034827 0.637731
4 0.980271 0.001098 0.918052 NaN
>>> df.loc[df[subset].isnull().any(axis=1), :]
A B C D
0 0.985707 0.806581 NaN 0.37386
4 0.980271 0.001098 0.918052 NaN
df[subset].isnull() returns boolean values of whether or not any of the subset columns have a NaN.
>>> df[subset].isnull()
C D
0 True False
1 False False
2 False False
3 False False
4 False True
.any(axis=1) will return True if any value in the row (because axis=1, otherwise the column) is True.
>>> df[subset].isnull().any(axis=1)
0 True
1 False
2 False
3 False
4 True
dtype: bool
Finally, use loc (rows, columns) to locate rows that satisfy a boolean condition. The : symbol means to select everything, so it selects all columns for rows 0 and 4.

Convert data frame column to float and perform operation in Pandas

I have a data frame that contains the following that are imported as strings
df3 = pd.DataFrame(data = {
'Column1':['10/1','9/5','7/4','12/3','18/7','14/2']})
I am tried to convert to float and do the division. The following did work well.
for i, v in enumerate(df3.Column1):
df3['Column2'] = float(v[:-2]) / float(v[-1])
print df3.Column2
This is the output that I am trying to achieve
df3 = pd.DataFrame(data = {
'Column1':['10/1','9/5','7/4','12/3','18/7','14/2'],
'Column2':['10.0','1.8','1.75','4.0','2.57142857143','7.0']})
df3
The following would work, define a function to perform the casting to float and return this, the result of which should be assigned to your new column:
In [10]:
df3 = pd.DataFrame(data = {
'Column1':['10/1','9/5','7/4','12/3','18/7','14/2']})
def func(x):
return float(x[:-2]) / float(x[-1])
df3['Column2'] = df3['Column1'].apply(func)
df3
Out[10]:
Column1 Column2
0 10/1 10.000000
1 9/5 1.800000
2 7/4 1.750000
3 12/3 4.000000
4 18/7 2.571429
5 14/2 7.000000
if, and ONLY IF, you do not have input/data from an untrusted source, here's a shortcut:
In [46]: df3
Out[46]:
Column1
0 10/1
1 9/5
2 7/4
3 12/3
4 18/7
5 14/2
In [47]: df3.Column1.map(eval)
Out[47]:
0 10.000000
1 1.800000
2 1.750000
3 4.000000
4 2.571429
5 7.000000
Name: Column1, dtype: float64
But seriously...be careful with eval.

Convert 2D numpy.ndarray to pandas.DataFrame

I have a pretty big numpy.ndarray. Its basically an array of arrays. I want to convert it to a pandas.DataFrame. What I want to do is in the code below
from pandas import DataFrame
cache1 = DataFrame([{'id1': 'ABC1234'}, {'id1': 'NCMN7838'}])
cache2 = DataFrame([{'id2': 3276827}, {'id2': 98567498}, {'id2': 38472837}])
ndarr = [[4.3, 5.6, 6.7], [3.2, 4.5, 2.1]]
arr = []
for idx, i in enumerate(ndarr):
id1 = cache1.ix[idx].id1
for idx2, val in enumerate(i):
id2 = cache2.ix[idx2].id2
if val > 0:
arr.append(dict(id1=id1, id2=id2, value=val))
df = DataFrame(arr)
print(df.head())
I am mapping the index of the outer array and the inner array to index of two DataFrames to get certain IDs.
cache1 and cache2 are pandas.DataFrame. Each has ~100k rows.
This takes really really long, like a few hours to complete.
Is there some way I can speed it up?
I suspect your ndarr, if expressed as a 2d np.array, always has the shape of n,m, where n is the length of cache1.id1 and m is the length of cache2.id2. And the last entry in cache2, should be {'id2': 38472837} instead of {'id': 38472837}. If so, the following simple solution may be all what is needed:
In [30]:
df=pd.DataFrame(np.array(ndarr).ravel(),
index=pd.MultiIndex.from_product([cache1.id1.values, cache2.id2.values],names=['idx1', 'idx2']),
columns=['val'])
In [33]:
print df.reset_index()
idx1 idx2 val
0 ABC1234 3276827 4.3
1 ABC1234 98567498 5.6
2 ABC1234 38472837 6.7
3 NCMN7838 3276827 3.2
4 NCMN7838 98567498 4.5
5 NCMN7838 38472837 2.1
[6 rows x 3 columns]
Actually, I also think, that keep it having the MultiIndex may be a better idea.
Something like this should work:
ndarr = np.asarray(ndarr) # if ndarr is actually an array, skip this
fast_df = pd.DataFrame({"value": ndarr.ravel()})
i1, i2 = [i.ravel() for i in np.indices(ndarr.shape)]
fast_df["id1"] = cache1["id1"].loc[i1].values
fast_df["id2"] = cache2["id2"].loc[i2].values
which gives
>>> fast_df
value id1 id2
0 4.3 ABC1234 3276827
1 5.6 ABC1234 98567498
2 6.7 ABC1234 NaN
3 3.2 NCMN7838 3276827
4 4.5 NCMN7838 98567498
5 2.1 NCMN7838 NaN
And then if you really want to drop the zero values, you can keep only the nonzero ones using fast_df = fast_df[fast_df['value'] != 0].