identifying which rows are present in another dataframe - python-2.7

I have two dataframes df1 and df2, which I'm told share some rows. That is, for some indices, (i,j)_n df1.loc[i] == df2.loc[j] exactly. I would like to find this correspondence.
This has been a tricky problem to track down. I don't want to "manually" inquire about each of the columns for each of the rows, so I've been searching for something cleaner.
This is the best I have but it's not fast. I'm hoping some guru can point me in the right direction.
matching_idx=[]
for ix in df1.index:
match =df1.loc[ix:ix].to_dict(orient='list')
matching_idx.append( df2.isin(match).all(axis=1) )
It would be nice to get rid of the for loop but I'm not sure it's possible.

Assuming the rows in each dataframes are unique, you can concatenate the two dataframes and search for duplicates.
df1 = pd.DataFrame({'A': ['a', 'b'], 'B': ['a', 'c']})
df2 = pd.DataFrame({'A': ['c', 'a'], 'B': ['c', 'a']})
>>> df1
A B
0 a a
1 b c
>>> df2
A B
0 c c
1 a a
df = pd.concat([df1, df2])
# Returns the index values of duplicates in `df2`.
>>> df[df.duplicated()]
A B
1 a a
# Returns the index value of duplicates in `df1`.
>>> df[df.duplicated(keep='last')]
A B
0 a a

You can do a merge that joins on all columns:
match = df1.merge(df2, on=list(df1.columns))

Related

from merged dataframe get o/p of both dataframes side by side with difference

I have 2 dataframes(df1,df2), where primary key of a column can be at any index.Once i have the smaller dataframe(tempDf1,tempdf2) from both with the primary key on a column, i need to check for match of other column values .If there is a mismatch i need to show the data side by side .And if data in df1 missed in df2, it will show dataDf1|NAN
I tried to merge using outer join and get the data which is missed in df1 or df2 .The result provides me right_only or left_only.Now i need to get the result such that i have the side by side differences.
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.array([
['a', 5,19],
['a', 51,191],
['b', 14, 16],
['c', 24, 9],
['a', 24, 9]]),
columns=['name', 'attr11', 'attr12'])
df2 = pd.DataFrame(np.array([
['b', 14, 16],
['a', 51,191],
['c', 4, 9],
['a', 5, 19]]),
columns=['name', 'attr11', 'attr12'])
merged=df1.merge(df2, indicator=True,how='outer')
merged[merged['_merge']=='left_only']
merged=merged.loc[merged['_merge'].str.contains("_only")]
print merged
Actual results:
name attr11 attr12 _merge
c 24 9 left_only
c 4 9 right_only
a 24 9 left_only
Expected result ->prints only the differences:
name attr11_df1 attr11_df2 attr12_df1 attr12_df2
c 24 4 9 9
a 24 NAN 9 NAN
And highlight the differences in color

Average of median of a column in a list of dataframes

I am looking for the best way to take the average of median of a column in a list of data frames (same column name).
let's say i have a list of dataframes list_df. I can write the following for loop to get the required output. I am more interested in looking if we can eliminate the for loop
med_arr = []
list_df = [df1, df2, df3]
for df in list_df:
med_arr.append(np.median(df['col_name']))
np.mean(med_arr)
Consider the sample data
np.random.seed([3,1415])
df1 = pd.DataFrame(dict(col_name=np.random.randint(10, size=10)))
df2 = pd.DataFrame(dict(col_name=np.random.randint(10, size=10)))
df3 = pd.DataFrame(dict(col_name=np.random.randint(10, size=10)))
list_df = [df1, df2, df3]
Option 1
pandas
pd.concat([d['col_name'] for d in list_df], axis=1).median().mean()
3.8333333333333335
Option 2
numpy
np.median([d['col_name'].values for d in list_df], 1).mean()
3.8333333333333335
This could be done as a list comprehension:
list_df = [ df1, df2, df3 ]
med_arr = [ np.median( df['col_name'] ) for df in list_df ]
np.mean(med_arr)

PYTHON 2.7 - Modifying List of Lists and Re-Assembling Without Mutating

I currently have a list of lists that looks like this:
My_List = [[This, Is, A, Sample, Text, Sentence] [This, too, is, a, sample, text] [finally, so, is, this, one]]
Now what I need to do is "tag" each of these words with one of 3, in this case arbitrary, tags such as "EE", "FF", or "GG" based on which list the word is in and then reassemble them into the same order they came in. My final code would need to look like:
GG_List = [This, Sentence]
FF_List = [Is, A, Text]
EE_List = [Sample]
My_List = [[(This, GG), (Is, FF), (A, FF), (Sample, "EE), (Text, FF), (Sentence, GG)] [*same with this sentence*] [*and this one*]]
I tried this by using for loops to turn each item into a dict but the dicts then got rearranged by their tags which sadly can't happen because of the nature of this thing... the experiment needs everything to stay in the same order because eventually I need to measure the proximity of tags relative to others but only in the same sentence (list).
I thought about doing this with NLTK (which I have little experience with) but it looks like that is much more sophisticated then what I need and the tags aren't easily customized by a novice like myself.
I think this could be done by iterating through each of these items, using an if statement as I have to determine what tag they should have, and then making a tuple out of the word and its associated tag so it doesn't shift around within its list.
I've devised this.. but I can't figure out how to rebuild my list-of-lists and keep them in order :(.
for i in My_List: #For each list in the list of lists
for h in i: #For each item in each list
if h in GG_List: # Check for the tag
MyDicts = {"GG":h for h in i} #Make Dict from tag + word
Thank you so much for your help!
Putting the tags in a dictionary would work:
My_List = [['This', 'Is', 'A', 'Sample', 'Text', 'Sentence'],
['This', 'too', 'is', 'a', 'sample', 'text'],
['finally', 'so', 'is', 'this', 'one']]
GG_List = ['This', 'Sentence']
FF_List = ['Is', 'A', 'Text']
EE_List = ['Sample']
zipped = zip((GG_List, FF_List, EE_List), ('GG', 'FF', 'EE'))
tags = {item: tag for tag_list, tag in zipped for item in tag_list}
res = [[(word, tags[word]) for word in entry if word in tags] for entry in My_List]
Now:
>>> res
[[('This', 'GG'),
('Is', 'FF'),
('A', 'FF'),
('Sample', 'EE'),
('Text', 'FF'),
('Sentence', 'GG')],
[('This', 'GG')],
[]]
Dictionary works by key-value pairs. Each key is assigned a value. To search the dictionary, you search the index by the key, e.g.
>>> d = {1:'a', 2:'b', 3:'c'}
>>> d[1]
'a'
In the above case, we always search the dictionary by its keys, i.e. the integers.
In the case that you want to assign the tag/label to each word, you are searching by the key word and finding the "value", i.e. the tag/label, so your dictionary would have to look something like this (assuming that the strings are words and numbers as tag/label):
>>> d = {'a':1, 'b':1, 'c':3}
>>> d['a']
1
>>> sent = 'a b c a b'.split()
>>> sent
['a', 'b', 'c', 'a', 'b']
>>> [d[word] for word in sent]
[1, 1, 3, 1, 1]
This way the order of the tags follows the order of the words when you use a list comprehension to iterate through the words and find the appropriate tags.
So the problem comes when you have the initial dictionary indexed with the wrong way, i.e. key -> labels, value -> words, e.g.:
>>> d = {1:['a', 'd'], 2:['b', 'h'], 3:['c', 'x']}
>>> [d[word] for word in sent]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: 'a'
Then you would have to reverse your dictionary, assuming that all elements in your value lists are unique, you can do this:
>>> from collections import ChainMap
>>> d = {1:['a', 'd'], 2:['b', 'h'], 3:['c', 'x']}
>>> d_inv = dict(ChainMap(*[{value:key for value in values} for key, values in d.items()]))
>>> d_inv
{'h': 2, 'c': 3, 'a': 1, 'x': 3, 'b': 2, 'd': 1}
But the caveat is that ChainMap is only available in Python3.5 (yet another reason to upgrade your Python ;P). For Python <3.5, solutions, see How do I merge a list of dicts into a single dict?.
So going back to the problem of assigning labels/tags to words, let's say we have these input:
>>> d = {1:['a', 'd'], 2:['b', 'h'], 3:['c', 'x']}
>>> sent = 'a b c a b'.split()
First, we invert the dictionary (assuming that there're one to one mapping for every word and its tag/label:
>>> d_inv = dict(ChainMap(*[{value:key for value in values} for key, values in d.items()]))
Then, we apply the tags to the words through a list comprehension:
>>> [d_inv[word] for word in sent]
[1, 2, 3, 1, 2]
And for multiple sentences:
>>> sentences = ['a b c'.split(), 'h a x'.split()]
>>> [[d_inv[word] for word in sent] for sent in sentences]
[[1, 2, 3], [2, 1, 3]]

Dictionary Key Error

I am trying to construct a dictionary with values from a csv file.Say 10 columns there and i want to set the first column as key and the remaining as Values.
If setting as a for loop the dictionary has to have only one value. Kindly Suggest me a way.
import csv
import numpy
aname = {}
#loading the file in numpy
result=numpy.array(list(csv.reader(open('somefile',"rb"),delimiter=','))).astype('string')
#devolop a dict\
r = {aname[rows[0]]: rows[1:] for rows in result}
print r[0]
Error as follows.
r = {aname[rows[0]]: rows[1:] for rows in result}
KeyError: '2a9ac84c-3315-5576-4dfd-8bc34072360d|11937055'
I'm not entirely sure what you mean to do here, but does this help:
>>> result = [[1, 'a', 'b'], [2, 'c', 'd']]
>>> dict([(row[0], row[1:]) for row in result])
{1: ['a', 'b'], 2: ['c', 'd']}

Column recoding based on count of distincts

I've a panda data frame like this:
import pandas as pd
data = {'VAR1': ['A', 'A', 'A', 'A','B', 'B'],
'VAR2': ['C', 'V', 'C', 'C','V', 'D']}
frame = pd.DataFrame(data)
Fundamentally I need to recode each variable. The recoding would work like this: calculate a count of distinct values for each column, and if the count is greater than or equal to a threshold, keep the original value, otherwise set a new value of 'X'. If the threshold were 3, then this is what it would need to look like.
data2 = {'VAR3': ['A', 'A', 'A', 'A','X', 'X'],
'VAR4': ['C', 'X', 'C', 'C','X', 'X']}
frame2 = pd.DataFrame(data2)
And this is the desired output, with the original data merged to the recoded data.
pd.merge(frame, frame2, left_index=True, right_index=True)
I'm new to Python and while the book Python for Data Analysis is really helping me, I still cannot quite figure out how to achieve the desired result in a simple way.
Any help would be appreciated!
Take each column individually. Group it by value, and use the filter method on groups to replace any group with less than 3 values with NaN. Then replace those NaNs with X.
You could do this all in one list comprehension, but for clarity I defined a recode function that does all the substantial stuff.
In [38]: def recode(s, threshold):
....: return s.groupby(s).filter(lambda x: x.count() >= threshold, dropna=False).fillna(value='X')
....:
Applying to each column and then reassembling the columns into one new DataFrame....
In [39]: frame2 = pd.concat([recode(frame[col], 3) for col in frame], axis=1)
In [40]: frame2
Out[40]:
VAR1 VAR2
0 A C
1 A X
2 A C
3 A C
4 X X
5 X X
And, to be sure, you can merge the original and the recoded frames just as you expressed it in your question:
In [27]: pd.merge(frame, frame2, left_index=True, right_index=True)
Out[27]:
VAR1_x VAR2_x VAR1_y VAR2_y
0 A C A C
1 A V A X
2 A C A C
3 A C A C
4 B V X X
5 B D X X
Edit: Use this equivalent workaround for pandas version < 0.12:
def recode(s, threshold):
b = s.groupby(s).transform(lambda x: x.count() >= threshold).astype('bool') # True/False
s[~b] = 'X'
return s