I have a dataframe which contain some integer values, I want to create a new dataframe of the row only if multiple columns [col1, col3, col4] are not ALL zeroes. Example:
col1 col2 col3 col4 col5 col6
0 0 text1 3 0 22 0
1 9 text2 13 11 22 1
2 0 text3 0 0 22 0 # not valid
3 9 text4 13 11 22 0
4 0 text5 0 1 12 4
I am not sure if possible to do this with a single lambda.
There's no need for any custom function at all. We can just select the columns we want, do our boolean comparison, and then use that to index into your dataframe:
In [28]: df[["col1", "col3", "col4"]] == 0
Out[28]:
col1 col3 col4
0 True False True
1 False False False
2 True True True
3 False False False
4 True True False
In [29]: (df[["col1", "col3", "col4"]] == 0).all(axis=1)
Out[29]:
0 False
1 False
2 True
3 False
4 False
dtype: bool
In [30]: df.loc[~(df[["col1", "col3", "col4"]] == 0).all(axis=1)]
Out[30]:
col1 col2 col3 col4 col5 col6
0 0 text1 3 0 22 0
1 9 text2 13 11 22 1
3 9 text4 13 11 22 0
4 0 text5 0 1 12 4
There are lots of similar ways to rewrite it (not all zeroes is any being nonzero, etc.)
Related
I have a dataframe C and another dataframe S. I want to change the values in one of the column of C if two columns in C and S have same values.
Please consider the example given below,
C.head(3)
id1 id2 title val
0 1 0 'abc' 0
1 2 0 'bcd' 0
2 3 0 'efg' 0
S.head(3)
id1 id2
0 1 1
1 3 0
I want to assign the value of 1 to the column 'val' in C corresponding only to the rows where C.id1 = S.id1 and C.id2 = S.id2
The combination of (C.id1, C.id2) and (S.id1, S.id2) is unique in respective tables
In the above case, I want the result as
C.head(3)
id1 id2 title val
0 1 0 'abc' 0
1 2 0 'bcd' 0
2 3 0 'efg' 1
as only in the third row of C, it matches with one of the rows of S for the columns id1 and id2.
I think need merge with left join and parameter indicator, last convert boolen mask to 0 and 1:
#if same columns for join in both df parameter on is possible omit
df = C.merge(S, indicator=True, how='left')
#if multiple same columns in both df
#df = C.merge(S, indicator=True, how='left', on=['id1', 'id2'])
df['val'] = (df['_merge'] == 'both').astype(int)
df = df.drop('_merge', axis=1)
print (df)
id1 id2 val
0 1 0 0
1 2 0 0
2 3 0 1
Solutiion working nice with new data:
df = C.merge(S, indicator=True, how='left')
#if multiple same columns in both df
#df = C.merge(S, indicator=True, how='left', on=['id1', 'id2'])
df['val'] = (df['_merge'] == 'both').astype(int)
df = df.drop('_merge', axis=1)
print (df)
id1 id2 title val
0 1 0 abc 0
1 2 0 bcd 0
2 3 0 efg 1
Using Power Query "M" language, how would you transform a categorical column containing discrete values into multiple "dummy" columns? I come from the Python world and there are several ways to do this but one way would be below:
>>> import pandas as pd
>>> dataset = pd.DataFrame(list('ABCDACDEAABADDA'),
columns=['my_col'])
>>> dataset
my_col
0 A
1 B
2 C
3 D
4 A
5 C
6 D
7 E
8 A
9 A
10 B
11 A
12 D
13 D
14 A
>>> pd.get_dummies(dataset)
my_col_A my_col_B my_col_C my_col_D my_col_E
0 1 0 0 0 0
1 0 1 0 0 0
2 0 0 1 0 0
3 0 0 0 1 0
4 1 0 0 0 0
5 0 0 1 0 0
6 0 0 0 1 0
7 0 0 0 0 1
8 1 0 0 0 0
9 1 0 0 0 0
10 0 1 0 0 0
11 1 0 0 0 0
12 0 0 0 1 0
13 0 0 0 1 0
14 1 0 0 0 0
Interesting question. Here's an easy, scalable method I've found:
Create a custom column of all ones (Add Column > Custom Column > Formula = 1).
Add an index column (Add Column > Index Column).
Pivot on the custom column (select my_col > Transform > Pivot Column).
Replace null values with 0 (select all columns > Transform > Replace Values).
Here's what the M code looks like for this process:
#"Added Custom" = Table.AddColumn(#"Previous Step", "Custom", each 1),
#"Added Index" = Table.AddIndexColumn(#"Added Custom", "Index", 0, 1),
#"Pivoted Column" = Table.Pivot(#"Added Index", List.Distinct(#"Added Index"[my_col]), "my_col", "Custom"),
#"Replaced Value" = Table.ReplaceValue(#"Pivoted Column",null,0,Replacer.ReplaceValue,Table.ColumnNames(#"Pivoted Column"))
Once you've completed the above, you can remove the index column if desired.
Want to convert user_Id and skills dataFrame matrix into zero one DataFrame matrix format user and their corresponding skills
Input DataFrame
user_Id skills
0 user1 [java, hdfs, hadoop]
1 user2 [python, c++, c]
2 user3 [hadoop, java, hdfs]
3 user4 [html, java, php]
4 user5 [hadoop, php, hdfs]
Desired Output DataFrame
user_Id java c c++ hadoop hdfs python html php
user1 1 0 0 1 1 0 0 0
user2 0 1 1 0 0 1 0 0
user3 1 0 0 1 1 0 0 0
user4 1 0 0 0 0 0 1 1
user5 0 0 0 1 1 0 0 1
You can join new DataFrame created by astype if need convert lists to str (else omit), then remove [] by strip and use get_dummies:
df = df[['user_Id']].join(df['skills'].astype(str).str.strip('[]').str.get_dummies(', '))
print (df)
user_Id c c++ hadoop hdfs html java php python
0 user1 0 0 1 1 0 1 0 0
1 user2 1 1 0 0 0 0 0 1
2 user3 0 0 1 1 0 1 0 0
3 user4 0 0 0 0 1 1 1 0
4 user5 0 0 1 1 0 0 1 0
df1 = df['skills'].astype(str).str.strip('[]').str.get_dummies(', ')
#if necessary remove ' from columns names
df1.columns = df1.columns.str.strip("'")
df = pd.concat([df['user_Id'], df1], axis=1)
print (df)
user_Id c c++ hadoop hdfs html java php python
0 user1 0 0 1 1 0 1 0 0
1 user2 1 1 0 0 0 0 0 1
2 user3 0 0 1 1 0 1 0 0
3 user4 0 0 0 0 1 1 1 0
4 user5 0 0 1 1 0 0 1 0
I'm pretty new to pandas and would like your input on how to tackle my problem. I've got the following data frame:
df = pd.DataFrame({'A' : ["me","you","you","me","me","me","me"],
'B' : ["Y","X","X","X","X","X","Z"],
'C' : ["1","2","3","4","5","6","7"]
})
I need to transform it based on the row values in column A and B. The logic should be that as soon as values in column A and B are the same on consecutive rows, the first row in this sequence should be maintained but following rows should have an 'A' set in column B.
For example: Values in column A and B are the same in row 1 and 2. Value in column B row 2 should be replaced with A. This is my expected output:
df2= pd.DataFrame({'A' : ["me","you","you","me","me","me","me"],
'B' : ["Y","X","A","X","A","A","Z"],
'C' : ["1","2","3","4","5","6","7"]})
You can first sum columns A and B:
a = df.A + df.B
Then compare with shifted version:
print (a != a.shift())
0 True
1 True
2 False
3 True
4 False
5 False
6 True
dtype: bool
Create unique groups by cumsum:
print ((a != a.shift()).cumsum())
0 1
1 2
2 2
3 3
4 3
5 3
6 4
dtype: int32
Get boolean mask where values are duplicated:
print ((a != a.shift()).cumsum().duplicated())
0 False
1 False
2 True
3 False
4 True
5 True
6 False
dtype: bool
Solutions for replace True values to A:
df.loc[(a != a.shift()).cumsum().duplicated(), 'B'] = 'A'
print (df)
A B C
0 me Y 1
1 you X 2
2 you A 3
3 me X 4
4 me A 5
5 me A 6
6 me Z 7
df.B = df.B.mask((a != a.shift()).cumsum().duplicated(), 'A')
print (df)
A B C
0 me Y 1
1 you X 2
2 you A 3
3 me X 4
4 me A 5
5 me A 6
6 me Z 7
print (df2.equals(df))
True
>>> df.head()
β Summer Gold Silver Bronze Total β Winter \
Afghanistan (AFG) 13 0 0 2 2 0
Algeria (ALG) 12 5 2 8 15 3
Argentina (ARG) 23 18 24 28 70 18
Armenia (ARM) 5 1 2 9 12 6
Australasia (ANZ) [ANZ] 2 3 4 5 12 0
Gold.1 Silver.1 Bronze.1 Total.1 β Games Gold.2 \
Afghanistan (AFG) 0 0 0 0 13 0
Algeria (ALG) 0 0 0 0 15 5
Argentina (ARG) 0 0 0 0 41 18
Armenia (ARM) 0 0 0 0 11 1
Australasia (ANZ) [ANZ] 0 0 0 0 2 3
Silver.2 Bronze.2 Combined total
Afghanistan (AFG) 0 2 2
Algeria (ALG) 2 8 15
Argentina (ARG) 24 28 70
Armenia (ARM) 2 9 12
Australasia (ANZ) [ANZ] 4 5 12
Not sure why do I see this error:
>>> df['Gold'] > 0 | df['Gold.1'] > 0
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/ankuragarwal/data_insight/env/lib/python2.7/site-packages/pandas/core/generic.py", line 917, in __nonzero__
.format(self.__class__.__name__))
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Whats ambiguous here ?
But this works:
>>> (df['Gold'] > 0) | (df['Gold.1'] > 0)
Assuming we have the following DF:
In [35]: df
Out[35]:
a b c
0 9 0 1
1 7 7 4
2 1 8 9
3 6 7 5
4 1 4 6
The following command:
df.a > 5 | df.b > 5
because | has higher precedence (compared to >) as it's specified in the Operator precedence table) it will be translated to:
df.a > (5 | df.b) > 5
which will be translated to:
df.a > (5 | df.b) and (5 | df.b) > 5
step by step:
In [36]: x = (5 | df.b)
In [37]: x
Out[37]:
0 5
1 7
2 13
3 7
4 5
Name: c, dtype: int32
In [38]: df.a > x
Out[38]:
0 True
1 False
2 False
3 False
4 False
dtype: bool
In [39]: x > 5
Out[39]:
0 False
1 True
2 True
3 True
4 False
Name: b, dtype: bool
but the last operation won't work:
In [40]: (df.a > x) and (x > 5)
---------------------------------------------------------------------------
...
skipped
...
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
The error message above might lead inexperienced users to do something like this:
In [12]: (df.a > 5).all() | (df.b > 5).all()
Out[12]: False
In [13]: df[(df.a > 5).all() | (df.b > 5).all()]
...
skipped
...
KeyError: False
But in this case you just need to set your precedence explicitly in order to get expected result:
In [10]: (df.a > 5) | (df.b > 5)
Out[10]:
0 True
1 True
2 True
3 True
4 False
dtype: bool
In [11]: df[(df.a > 5) | (df.b > 5)]
Out[11]:
a b c
0 9 0 1
1 7 7 4
2 1 8 9
3 6 7 5
This is the real reason for the error:
http://pandas.pydata.org/pandas-docs/stable/gotchas.html
pandas follows the numpy convention of raising an error when you try to convert something to a bool. This happens in a if or when using the boolean operations, and, or, or not. It is not clear what the result of
>>> if pd.Series([False, True, False]):
...
should be. Should it be True because itβs not zero-length? False because there are False values? It is unclear, so instead, pandas raises a ValueError:
>>> if pd.Series([False, True, False]):
print("I was true")
Traceback
...
ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().
If you see that, you need to explicitly choose what you want to do with it (e.g., use any(), all() or empty). or, you might want to compare if the pandas object is None