Column recoding based on count of distincts - python-2.7

I've a panda data frame like this:
import pandas as pd
data = {'VAR1': ['A', 'A', 'A', 'A','B', 'B'],
'VAR2': ['C', 'V', 'C', 'C','V', 'D']}
frame = pd.DataFrame(data)
Fundamentally I need to recode each variable. The recoding would work like this: calculate a count of distinct values for each column, and if the count is greater than or equal to a threshold, keep the original value, otherwise set a new value of 'X'. If the threshold were 3, then this is what it would need to look like.
data2 = {'VAR3': ['A', 'A', 'A', 'A','X', 'X'],
'VAR4': ['C', 'X', 'C', 'C','X', 'X']}
frame2 = pd.DataFrame(data2)
And this is the desired output, with the original data merged to the recoded data.
pd.merge(frame, frame2, left_index=True, right_index=True)
I'm new to Python and while the book Python for Data Analysis is really helping me, I still cannot quite figure out how to achieve the desired result in a simple way.
Any help would be appreciated!

Take each column individually. Group it by value, and use the filter method on groups to replace any group with less than 3 values with NaN. Then replace those NaNs with X.
You could do this all in one list comprehension, but for clarity I defined a recode function that does all the substantial stuff.
In [38]: def recode(s, threshold):
....: return s.groupby(s).filter(lambda x: x.count() >= threshold, dropna=False).fillna(value='X')
....:
Applying to each column and then reassembling the columns into one new DataFrame....
In [39]: frame2 = pd.concat([recode(frame[col], 3) for col in frame], axis=1)
In [40]: frame2
Out[40]:
VAR1 VAR2
0 A C
1 A X
2 A C
3 A C
4 X X
5 X X
And, to be sure, you can merge the original and the recoded frames just as you expressed it in your question:
In [27]: pd.merge(frame, frame2, left_index=True, right_index=True)
Out[27]:
VAR1_x VAR2_x VAR1_y VAR2_y
0 A C A C
1 A V A X
2 A C A C
3 A C A C
4 B V X X
5 B D X X
Edit: Use this equivalent workaround for pandas version < 0.12:
def recode(s, threshold):
b = s.groupby(s).transform(lambda x: x.count() >= threshold).astype('bool') # True/False
s[~b] = 'X'
return s

Related

Selecting only one row at a time for iteration in PANDAS-PYTHON

I have this following code and a text file with 5 (X and Y) values The Image of the text file is here. I need to iterate 1000 times for every X and Y value. How can I achieve this?
import pandas as pd
data = pd.read_csv("test.txt", delim_whitespace=True, skipinitialspace=True,)
for every line in the text document:
for i in range(1, 1001, 1):
z = data["X"] + data["Y"]
z = z + 10
print z
The text file is like
X Y
1 10
2 20
3 30
4 40
5 50
The output must be:
10011
10022
10033
10044
10055
You can select one row at the time using .loc. Please read this documentation to fully understand how this work. Here is your data:
import pandas as pd
df = pd.DataFrame({'X':['1','2','3','4','5'], 'Y': ['10','20','30','40','50']})
This code
print df.loc[0]
will give you the first row (with index=0) as a pandas series (pd.Series), which is essentially like a dataframe with one column only: a vector.
X 1
Y 10
Name: 0, dtype: object
If you want the second row then: df.loc[1] and so on...
If you want to iterate one row at the time, you can select each row in the first for loop and perform your operations 1000 times in the second for loop:
for ix in df.index: # df.index gives [0,1,2,3,4]
for i in xrange(0,1000):
ser = df.loc[ix]
print ser['X'] + ser['Y'] + '10'
Try this,
data = pd.DataFrame({'X': [1, 2, 3, 4, 5], 'Y': [10,20,30,40,50]})
for each_line in data.index:
z = data['X'].loc[each_line] + data['Y'].loc[each_line]
for i in range(1,1001,1):
z +=10
print(z)
Output
10011
10022
10033
10044
10055
if you want to add new column to dataFrame:
data["total"] = sorted(set([(data.loc[ix]['X'].astype(int) + data.loc[ix]['Y'].astype(int)).astype(str) +"10" for ix in data.index for i in range(1,1001)]))
if you want concatenate the 'X' and 'Y' +'10' then:
[data.loc[ix]['X'].astype(str) + data.loc[ix]['Y'].astype(str) +"10" for ix in data.index for i in range(1,1001)]
And if you want to sum of 'X' + 'Y' and concat + '10' then:
final_data = [(data.loc[ix]['X'].astype(int) + data.loc[ix]['Y'].astype(int)).astype(str) +"10" for ix in data.index for i in range(1,1001)]

How to create a column in pandas dataframe using conditions defined in dict

Here's my code:
import pandas as pd
import numpy as np
input = {'name': ['Andy', 'Alex', 'Amy', "Olivia" ],
'rating': ['A', 'A', 'B', "B" ],
'score': [100, 60, 70, 95]}
df = pd.DataFrame(input)
df['valid1']=np.where((df['score']==100) & (df['rating']=='A'),'true','false')
The code above works fine to set a new column 'valid1' data as 'true' for score is 100 and 'rating' is A.
If the condition comes from a dict variable as
c = {'score':'100', 'rating':'A'}
How can I use the condition defined in c to get the same result 'valid' column value? I tried the following code
for key,value in c.iteritems():
df['valid2']=np.where((df[key]==value),'true','false')
got an error:
TypeError: Invalid type comparison
I'd define c as a pd.Series so that when you compare it to a dataframe, it automatically compares agains each row while matching columns with series indices. Note that I made sure 100 was an integer and not a string.
c = pd.Series({'score':100, 'rating':'A'})
i = df.columns.intersection(c.index)
df.assign(valid1=df[i].eq(c).all(1))
name rating score valid1
0 Andy A 100 True
1 Alex A 60 False
2 Amy B 70 False
3 Olivia B 95 False
You can use the same series and still use numpy to speed things up
c = pd.Series({'score':100, 'rating':'A'})
i = df.columns.intersection(c.index)
v = np.column_stack(df[c].values for c in i)
df.assign(valid1=(v == c.loc[i].values).all(1))
name rating score valid1
0 Andy A 100 True
1 Alex A 60 False
2 Amy B 70 False
3 Olivia B 95 False

identifying which rows are present in another dataframe

I have two dataframes df1 and df2, which I'm told share some rows. That is, for some indices, (i,j)_n df1.loc[i] == df2.loc[j] exactly. I would like to find this correspondence.
This has been a tricky problem to track down. I don't want to "manually" inquire about each of the columns for each of the rows, so I've been searching for something cleaner.
This is the best I have but it's not fast. I'm hoping some guru can point me in the right direction.
matching_idx=[]
for ix in df1.index:
match =df1.loc[ix:ix].to_dict(orient='list')
matching_idx.append( df2.isin(match).all(axis=1) )
It would be nice to get rid of the for loop but I'm not sure it's possible.
Assuming the rows in each dataframes are unique, you can concatenate the two dataframes and search for duplicates.
df1 = pd.DataFrame({'A': ['a', 'b'], 'B': ['a', 'c']})
df2 = pd.DataFrame({'A': ['c', 'a'], 'B': ['c', 'a']})
>>> df1
A B
0 a a
1 b c
>>> df2
A B
0 c c
1 a a
df = pd.concat([df1, df2])
# Returns the index values of duplicates in `df2`.
>>> df[df.duplicated()]
A B
1 a a
# Returns the index value of duplicates in `df1`.
>>> df[df.duplicated(keep='last')]
A B
0 a a
You can do a merge that joins on all columns:
match = df1.merge(df2, on=list(df1.columns))

Assigning new column name and creating new column conditionally in pandas not working?

I have a simple dataframe with pandas, then I rename the variable names into 'a' and 'b'.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
df.columns = ['a', 'b']
print df
df['color'] = np.where(df['b']=='Z', 'green', 'red')
print df
a b
0 Z A
1 Z B
2 X B
3 Y C
a b color
0 Z A red
1 Z B red
2 X B red
3 Y C red
Without the renaming line df.columns, I get
import pandas as pd
import numpy as np
df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
#df.columns = ['a', 'b']
#print df
df['color'] = np.where(df['Set']=='Z', 'green', 'red')
print df
Set Type color
0 Z A green
1 Z B green
2 X B red
3 Y C red
I want and would expect the first set of code to produce "green green red red", but it failed and I don't know why.
As pointed out in the comments, the problem comes from how you are rename the columns. You are better off renaming, like so:
df = df.rename( columns={'Set': 'a','Type': 'b'})

All triplet combinations, 6 values at a time

I am looking for an algorithm to efficiently to generate all three value combinations of a dataset by picking 6 values at a time.
I am looking for an algorithm to efficiently generate a small set of 6-tuples that cumulatively express all possible 3-tuple combinations of a dataset.
For instance, computing playing-card hands of 6 cards that express all possible 3 card combinations.
For example, given a dataset:
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
The first "pick" of 6 values might be:
['a','b','c','d','e','f']
And this covers the three-value combinations:
('a', 'b', 'c'), ('a', 'b', 'd'), ('a', 'b', 'e'), ('a', 'b', 'f'), ('a', 'c', 'd'),
('a', 'c', 'e'), ('a', 'c', 'f'), ('a', 'd', 'e'), ('a', 'd', 'f'), ('a', 'e', 'f'),
('b', 'c', 'd'), ('b', 'c', 'e'), ('b', 'c', 'f'), ('b', 'd', 'e'), ('b', 'd', 'f'),
('b', 'e', 'f'), ('c', 'd', 'e'), ('c', 'd', 'f'), ('c', 'e', 'f'), ('d', 'e', 'f')
It is obviously possible by:
computing all triplet combinations
picking 6 values
computing all triplet combinations for those 6 values
removing these combinations from the first computation
repeating until all have been accounted for
In this example there are 2600 possible triplet combinations (26*25*24)/(3*2*1) == 2600 and using the "brute-force" method above, all triplet combinations can be represented in around 301 6-value groups.
However, it feels like there ought to be a more efficient way of achieving this.
My preferred language is python, but I'm planning on implementing this in C++.
Update
Here's my python code to "brute-force" it:
from itertools import combinations
data_set = list('abcdefghijklmnopqrstuvwxyz')
def calculate(data_set):
all_triplets = list(frozenset(x) for x in itertools.combinations(data_set,3))
data = set(all_triplets)
sextuples = []
while data:
sxt = set()
for item in data:
nxt = sxt | item
if len(nxt) > 6:
continue
sxt = nxt
if len(nxt) == 6:
break
sextuples.append(list(sxt))
covers = set(frozenset(x) for x in combinations(list(sxt),3))
data = data - covers
print "%r\t%s" % (list(sxt),len(data))
print "Completed %s triplets in %s sextuples" % (len(all_triplets),len(sextuples),)
calculate(data_set)
Completed 2600 triplets in 301 sextuples
I'm looking for something more computationally efficient than this.
Update
Senderle has provided an interesting solution: to divide the dataset into pairs then generate all possible triplets of the pairs. This is definitely better than anything I'd come up with.
Here's a quick function to check whether all triplets are covered and assess the redundancy of triplet coverage:
from itertools import combinations
def check_coverage(data_set,sextuplets):
all_triplets = dict.fromkeys(combinations(data_set,3),0)
sxt_count = 0
for sxt in sextuplets:
sxt_count += 1
for triplet in combinations(sxt,3):
all_triplets[triplet] += 1
total = len(all_triplets)
biggest_overlap = overlap = nohits = onehits = morehits = 0
for k,v in all_triplets.iteritems():
if v == 0:
nohits += 1
elif v == 1:
onehits += 1
else:
morehits += 1
overlap += v - 1
if v > biggest_overlap:
biggest_overlap = v
print "All Triplets in dataset: %6d" % (total,)
print "Total triplets from sxt: %6d" % (total + overlap,)
print "Number of sextuples: %6d\n" % (sxt_count,)
print "Missed %6d of %6d: %6.1f%%" % (nohits,total,100.0*nohits/total)
print "HitOnce %6d of %6d: %6.1f%%" % (onehits,total,100.0*onehits/total)
print "HitMore %6d of %6d: %6.1f%%" % (morehits,total,100.0*morehits/total)
print "Overlap %6d of %6d: %6.1f%%" % (overlap,total,100.0*overlap/total)
print "Biggest Overlap: %3d" % (biggest_overlap,)
Using Senderle's sextuplets generator I'm fascinated to observe that the repeated triplets are localised and as the datasets increase in size, the repeats become proportionally more localised and the peak repeat larger.
>>> check_coverage(range(26),sextuplets(range(26)))
All Triplets in dataset: 2600
Total triplets from sxt: 5720
Number of sextuples: 286
Missed 0 of 2600: 0.0%
HitOnce 2288 of 2600: 88.0%
HitMore 312 of 2600: 12.0%
Overlap 3120 of 2600: 120.0%
Biggest Overlap: 11
>>> check_coverage(range(40),sextuplets(range(40)))
All Triplets in dataset: 9880
Total triplets from sxt: 22800
Number of sextuples: 1140
Missed 0 of 9880: 0.0%
HitOnce 9120 of 9880: 92.3%
HitMore 760 of 9880: 7.7%
Overlap 12920 of 9880: 130.8%
Biggest Overlap: 18
>>> check_coverage(range(80),sextuplets(range(80)))
All Triplets in dataset: 82160
Total triplets from sxt: 197600
Number of sextuples: 9880
Missed 0 of 82160: 0.0%
HitOnce 79040 of 82160: 96.2%
HitMore 3120 of 82160: 3.8%
Overlap 115440 of 82160: 140.5%
Biggest Overlap: 38
I believe the following produces correct results. It relies on the intuition that to generate all necessary sextuplets, all that is necessary is to generate all possible combinations of arbitrary pairs of items. This "mixes" values together well enough that all possible triplets are represented.
There's a slight wrinkle. For an odd number of items, one pair isn't a pair at all, so you can't generate a sextuplet from it, but the value still needs to be represented. This does some gymnastics to sidestep that problem; there might be a better way, but I'm not sure what it is.
from itertools import izip_longest, islice, combinations
def sextuplets(seq, _fillvalue=object()):
if len(seq) < 6:
yield [tuple(seq)]
return
it = iter(seq)
pairs = izip_longest(it, it, fillvalue=_fillvalue)
sextuplets = (a + b + c for a, b, c in combinations(pairs, 3))
for st in sextuplets:
if st[-1] == _fillvalue:
# replace fill value with valid item not in sextuplet
# while maintaining original order
for i, (x, y) in enumerate(zip(st, seq)):
if x != y:
st = st[0:i] + (y,) + st[i:-1]
break
yield st
I tested it on sequences of items of length 10 to 80, and it generates correct results in all cases. I don't have a proof that this will give correct results for all sequences though. I also don't have a proof that this is a minimal set of sextuplets. But I'd love to hear a proof of either, if anyone can come up with one.
>>> def gen_triplets_from_sextuplets(st):
... triplets = [combinations(s, 3) for s in st]
... return set(t for trip in triplets for t in trip)
...
>>> test_items = [xrange(n) for n in range(10, 80)]
>>> triplets = [set(combinations(i, 3)) for i in test_items]
>>> st_triplets = [gen_triplets_from_sextuplets(sextuplets(i))
for i in test_items]
>>> all(t == s for t, s in zip(triplets, st_triplets))
True
Although I already said so, I'll point out again that this is an inefficient way to actually generate triplets, as it produces duplicates.
>>> def gen_triplet_list_from_sextuplets(st):
... triplets = [combinations(s, 3) for s in st]
... return list(t for trip in triplets for t in trip)
...
>>> tlist = gen_triplet_list_from_sextuplets(sextuplets(range(10)))
>>> len(tlist)
200
>>> len(set(tlist))
120
>>> tlist = gen_triplet_list_from_sextuplets(sextuplets(range(80)))
>>> len(tlist)
197600
>>> len(set(tlist))
82160
Indeed, although theoretically you should get a speedup...
>>> len(list(sextuplets(range(80))))
9880
... itertools.combinations still outperforms sextuplets for small sequences:
>>> %timeit list(sextuplets(range(20)))
10000 loops, best of 3: 68.4 us per loop
>>> %timeit list(combinations(range(20), 3))
10000 loops, best of 3: 55.1 us per loop
And it's still competitive with sextuplets for medium-sized sequences:
>>> %timeit list(sextuplets(range(200)))
10 loops, best of 3: 96.6 ms per loop
>>> %timeit list(combinations(range(200), 3))
10 loops, best of 3: 167 ms per loop
Unless you're working with very large sequences, I'm not sure this is worth the trouble. (Still, it was an interesting problem.)