How to compare values in pandas between two different columns? - python-2.7

My Table:
A Country Code1 Code2
626349 US 640AD1237 407223
702747 NaN IO1062123 407255
824316 US NaN NaN
712947 US 00220221 870262123
278147 Canada 721AC31234 109123
278144 Canada NaN 7214234321
278142 Canada 72142QW134 109123AS12
Here in the above table I need to check country and code.
I want a 5th column with correct or wrong, pseudocode:
If 'Country' == 'US' and (length(Code1) OR length(Code2) == 9):
Add values to 5th column as correct.
else:
Add values to 5th column as incorrect.
If 'Country' == 'Canada' and (length(Code1) OR length(Code2) == 10):
Add values to 5th column as correct.
else:
Add values to 5th column as incorrect.
if no values are there either in Country or Code Column than insufficient information.
I am not able to understand how should I do this in pandas. Please help. Thanks.
I tried to first find the length of rows of Code1 and Code2 and store it in different df but after that I am not able to Compare the different set of data as what I need to do.
Len1 = df.Code1.map(len)
Len2 = df.Code2.map(len)
LengthCode = pd.DataFrame({'Len_Code1': Len1,'Len_Code2': Len2})
Please tell me the better way of how to do this in single dataframe if possible.
I tried this
df[(df.Country == 'US') & ((df.Code1.str.len() == 9)|(df.Code2.str.len() == 9))|(df.Country == 'Canada') & ((df.Code1.str.len() == 10)|(df.Code2.str.len() == 10))]
But it is getting long and I will not be able to write for many countries.

This will give you a 'is_correct' boolean column:
code_lengths = {'US':9, 'Canada':10}
df['correct_code_length'] = df.Country.replace(code_lengths)
df['is_correct'] = (df.Code1.apply(lambda x: len(str(x))) == df.correct_code_length) | (df.Code2.apply(lambda x: len(str(x))) == df.correct_code_length)
You will need to populate the code_lengths dictionary with more countries as necessary.

Related

Filter data using IF Statement in Tableau

I have a data source in tableau that looks something similar to this:
SKU Backup_Storage
A 5
A 1
B 2
B 3
C 1
D 0
I'd like to create a calculated field in tableau that performs a SUM calculation IF the SKU column contains the string 'A' or 'D' , and to perform an AVERAGE calculation if the SKU column contains the letters 'C' or 'B'
This is what I am doing:
IF CONTAINS(ATTR([SKU]),'A') or
CONTAINS(ATTR([SKU]),'D')
THEN SUM([Backup_Storage])
ELSEIF CONTAINS(ATTR([SKU]),'B') or
CONTAINS(ATTR([SKU]),'C')
THEN AVG([Backup_Storage])
END
UPDATE - desired output would be:
SKU BACKUP
A, D 6 (This is the SUM OF A and D)
B, C 2 (This is the AVG of B and C)
The calculation above shows as valid, however, I see NULLS in my data source table.
Any suggestion is appreciated.
I have named the calculated field:
SKU_FILTER_CALCULATION
Basically, IF THEN ELSE condition works when one test that is either TRUE/FALSE. Your specified condition is not a proper use case of IF THEN ELSE because SKUs can take all possible values. See it like this..
your data
SKU Backup_Storage
A 5
A 1
B 2
B 3
C 1
D 0
Let's name your calc field as CF, then CF will take value A in first row and will output SUM(5) = 5. For second row it will output sum(1) = 1, for third and onward rows it will output as avg(2) = 2, avg(3) = 3, avg(1) and sum(0) respectively. all these values just equals [Backup_storage] only and I'm sure that this you're not trying to get.
If instead you are trying to get sum(5,1,0) + avg(2,3,1) (obviously i have assumed + here) which equals 8 i.e. one single value for whole dataset, please proceed with this calculated field..
SUM(IF CONTAINS([SKU], 'A') OR CONTAINS([SKU], 'D')
THEN [Backup storage] END)
+
AVG(IF CONTAINS([SKU], 'B') OR CONTAINS([SKU], 'C')
THEN [Backup storage] END)
This will return an 8 when put to view
Needless to say, if you want any other operator instead of + you have to change that in CF accordingly
As per your edited post, I suggest a different methodology. Create diff groups where you want to perform different aggregations
Step-1 Create groups on SKU field. I have named this group as SKUG
Step-2 create a calculated field CF as
SUM(ZN(IF CONTAINS([SKU], 'A') OR CONTAINS([SKU], 'D')
THEN [Backup storage] END))
+
AVG(ZN(IF CONTAINS([SKU], 'B') OR CONTAINS([SKU], 'C')
THEN [Backup storage] END))
Step-3 get your desired view
Good Luck

select column with non-zero values from dataframe

I have data like the data below. I would like to only return the columns from the dataframe that contain at least one non-zero value. So in the example below it would be column ALF. Returning non-zero rows doesn’t seem that tricky but selecting the column and records is giving me a little trouble.
print df
Data:
Type ADR ALE ALF AME
Seg0 0.0 0.0 0.0 0.0
Seg1 0.0 0.0 0.5 0.0
When I try something like the link below:
Pandas: How to select columns with non-zero value in a sparse table
m1 = (df['Type'] == 'Seg0')
m2 = (df[m1] != 0).all()
print (df.loc[m1,m2])
I get a key error for 'Type'
In my opinion you get key error because first column is index:
Solution use DataFrame.any for check at least one non zero value to mask and then filter index of Trues:
m2 = (df != 0).any()
a = m2.index[m2]
print (a)
Index(['ALF'], dtype='object')
Or if need list:
a = m2.index[m2].tolist()
print (a)
['ALF']
Similar solution is filter columns names:
a = df.columns[m2]
Detail:
print (m2)
ADR False
ALE False
ALF True
AME False
dtype: bool

Python If Else Segmentation 'Invalid Type Comparison' Error in Dataframe

I have a sample dataframe (df_merged_1) shown below:
The Revenue column is a float64 dtype. I want to create a new column called 'Revenue_Segment'. This is what I want the end result to look like:
Below is the code I used to segment:
if df_merged_1['Revenue'] >= 0 and df_merged_1['Revenue'] <= 2200:
df_merged_1['AUM_Segment'] == 'test1'
else:
df_merged_1['AUM_Segment'] == 'test0'
But the code is not working ... I get the following error:
TypeError: invalid type comparison
Any help is greatly appreciated!
Not elegant, but this is the only solution I can think of now:
# Make a new list to add it later to your df
rev_segment = []
# Loop through the Revenue column
for revenue in df_merged_1['Revenue']:
if revenue >= 0 and revenue <= 2200:
rev_segment.append('test1')
else:
rev_segment.append('test0')
# Now append the list to a new column
df_merged_1['Revenue_Segment'] = rev_segment
Try this:
df_merged_1['AUM_Segment'] = \
np.where((df_merged_1['Revenue']>=0) & (df_merged_1['Revenue']<=2200), 'test1', 'test2')

KeyError: Not in index, using a keys generated from a Pandas dataframe on itself

I have two columns in a Pandas DataFrame that has datetime as its index. The two column contain data measuring the same parameter but neither column is complete (some row have no data at all, some rows have data in both column and other data on in column 'a' or 'b').
I've written the following code to find gaps in columns, generate a list of indices of dates where these gaps appear and use this list to find and replace missing data. However I get a KeyError: Not in index on line 3, which I don't understand because the keys I'm using to index came from the DataFrame itself. Could somebody explain why this is happening and what I can do to fix it? Here's the code:
def merge_func(df):
null_index = df[(df['DOC_mg/L'].isnull() == False) & (df['TOC_mg/L'].isnull() == True)].index
df['TOC_mg/L'][null_index] = df[null_index]['DOC_mg/L']
notnull_index = df[(df['DOC_mg/L'].isnull() == True) & (df['TOC_mg/L'].isnull() == False)].index
df['DOC_mg/L'][notnull_index] = df[notnull_index]['TOC_mg/L']
df.insert(len(df.columns), 'Mean_mg/L', 0.0)
df['Mean_mg/L'] = (df['DOC_mg/L'] + df['TOC_mg/L']) / 2
return df
merge_func(sve)
Whenever you are considering performing assignment then you should use .loc:
df.loc[null_index,'TOC_mg/L']=df['DOC_mg/L']
The error in your original code is the ordering of the subscript values for the index lookup:
df['TOC_mg/L'][null_index] = df[null_index]['DOC_mg/L']
will produce an index error, I get the error on a toy dataset: IndexError: indices are out-of-bounds
If you changed the order to this it would probably work:
df['TOC_mg/L'][null_index] = df['DOC_mg/L'][null_index]
However, this is chained assignment and should be avoided, see the online docs
So you should use loc:
df.loc[null_index,'TOC_mg/L']=df['DOC_mg/L']
df.loc[notnull_index, 'DOC_mg/L'] = df['TOC_mg/L']
note that it is not necessary to use the same index for the rhs as it will align correctly

Combining data from two dataframe columns into one column

I have time series data in two separate DataFrame columns which refer to the same parameter but are of differing lengths.
On dates where data only exist in one column, I'd like this value to be placed in my new column. On dates where there are entries for both columns, I'd like to have the mean value. (I'd like to join using the index, which is a datetime value)
Could somebody suggest a way that I could combine my two columns? Thanks.
Edit2: I written some code which should merge the data from both of my column, but I get a KeyError when I try to set the new values using my index generated from rows where my first df has values but my second df doesn't. Here's the code:
def merge_func(df):
null_index = df[(df['DOC_mg/L'].isnull() == False) & (df['TOC_mg/L'].isnull() == True)].index
df['TOC_mg/L'][null_index] = df[null_index]['DOC_mg/L']
notnull_index = df[(df['DOC_mg/L'].isnull() == True) & (df['TOC_mg/L'].isnull() == False)].index
df['DOC_mg/L'][notnull_index] = df[notnull_index]['TOC_mg/L']
df.insert(len(df.columns), 'Mean_mg/L', 0.0)
df['Mean_mg/L'] = (df['DOC_mg/L'] + df['TOC_mg/L']) / 2
return df
merge_func(sve)
And here's the error:
KeyError: "['2004-01-14T01:00:00.000000000+0100' '2004-03-04T01:00:00.000000000+0100'\n '2004-03-30T02:00:00.000000000+0200' '2004-04-12T02:00:00.000000000+0200'\n '2004-04-15T02:00:00.000000000+0200' '2004-04-17T02:00:00.000000000+0200'\n '2004-04-19T02:00:00.000000000+0200' '2004-04-20T02:00:00.000000000+0200'\n '2004-04-22T02:00:00.000000000+0200' '2004-04-26T02:00:00.000000000+0200'\n '2004-04-28T02:00:00.000000000+0200' '2004-04-30T02:00:00.000000000+0200'\n '2004-05-05T02:00:00.000000000+0200' '2004-05-07T02:00:00.000000000+0200'\n '2004-05-10T02:00:00.000000000+0200' '2004-05-13T02:00:00.000000000+0200'\n '2004-05-17T02:00:00.000000000+0200' '2004-05-20T02:00:00.000000000+0200'\n '2004-05-24T02:00:00.000000000+0200' '2004-05-28T02:00:00.000000000+0200'\n '2004-06-04T02:00:00.000000000+0200' '2004-06-10T02:00:00.000000000+0200'\n '2004-08-27T02:00:00.000000000+0200' '2004-10-06T02:00:00.000000000+0200'\n '2004-11-02T01:00:00.000000000+0100' '2004-12-08T01:00:00.000000000+0100'\n '2011-02-21T01:00:00.000000000+0100' '2011-03-21T01:00:00.000000000+0100'\n '2011-04-04T02:00:00.000000000+0200' '2011-04-11T02:00:00.000000000+0200'\n '2011-04-14T02:00:00.000000000+0200' '2011-04-18T02:00:00.000000000+0200'\n '2011-04-21T02:00:00.000000000+0200' '2011-04-25T02:00:00.000000000+0200'\n '2011-05-02T02:00:00.000000000+0200' '2011-05-09T02:00:00.000000000+0200'\n '2011-05-23T02:00:00.000000000+0200' '2011-06-07T02:00:00.000000000+0200'\n '2011-06-21T02:00:00.000000000+0200' '2011-07-04T02:00:00.000000000+0200'\n '2011-07-18T02:00:00.000000000+0200' '2011-08-31T02:00:00.000000000+0200'\n '2011-09-13T02:00:00.000000000+0200' '2011-09-28T02:00:00.000000000+0200'\n '2011-10-10T02:00:00.000000000+0200' '2011-10-25T02:00:00.000000000+0200'\n '2011-11-08T01:00:00.000000000+0100' '2011-11-28T01:00:00.000000000+0100'\n '2011-12-20T01:00:00.000000000+0100' '2012-01-19T01:00:00.000000000+0100'\n '2012-02-14T01:00:00.000000000+0100' '2012-03-13T01:00:00.000000000+0100'\n '2012-03-27T02:00:00.000000000+0200' '2012-04-02T02:00:00.000000000+0200'\n '2012-04-10T02:00:00.000000000+0200' '2012-04-17T02:00:00.000000000+0200'\n '2012-04-26T02:00:00.000000000+0200' '2012-04-30T02:00:00.000000000+0200'\n '2012-05-03T02:00:00.000000000+0200' '2012-05-07T02:00:00.000000000+0200'\n '2012-05-10T02:00:00.000000000+0200' '2012-05-14T02:00:00.000000000+0200'\n '2012-05-22T02:00:00.000000000+0200' '2012-06-05T02:00:00.000000000+0200'\n '2012-06-19T02:00:00.000000000+0200' '2012-07-03T02:00:00.000000000+0200'\n '2012-07-17T02:00:00.000000000+0200' '2012-07-31T02:00:00.000000000+0200'\n '2012-08-14T02:00:00.000000000+0200' '2012-08-28T02:00:00.000000000+0200'\n '2012-09-11T02:00:00.000000000+0200' '2012-09-25T02:00:00.000000000+0200'\n '2012-10-10T02:00:00.000000000+0200' '2012-10-24T02:00:00.000000000+0200'\n '2012-11-21T01:00:00.000000000+0100' '2012-12-18T01:00:00.000000000+0100'] not in index"
You are close, but you actually don't need to iterate over the rows when using the isnull() functions. by default
df[(df['DOC_mg/L'].isnull() == False) & (df['TOC_mg/L'].isnull() == True)].index
Will return just the index of the rows where DOC_mg/L is not null and TOC_mg/L is null.
Now you can do something like this to set the values for TOC_mg/L:
null_index = df[(df['DOC_mg/L'].isnull() == False) & \
(df['TOC_mg/L'].isnull() == True)].index
df['TOC_mg/L'][null_index] = df['DOC_mg/L'][null_index] # EDIT To switch the index position.
This will use the index of the rows where TOC_mg/L is null and DOC_mg/L is not null, and set the values for TOC_mg/L to the those found in DOC_mg/L in the same rows.
Note: This is not the accepted way for setting values using an index, but it is how I've been doing it for some time. Just make sure that when setting values, the left side of the equation is df['col_name'][index]. If col_name and index are switched you will set the values to a copy which is never set back to the original.
Now to set the mean, you can create a new column, we'll call this Mean_mg/L and set the value = 0.0. Then set this new column to the mean of both columns:
# Insert a new col at the end of the dataframe columns name 'Mean_mg/L'
# with default value 0.0
df.insert(len(df.columns), 'Mean_mg/L', 0.0)
# Set this columns value to the average of DOC_mg/L and TOC_mg/L
df['Mean_mg/L'] = (df['DOC_mg/L'] + df['TOC_mg/L']) / 2
In the columns where we filled null values with the corresponding column value, the average will be the same as the values.