pandas: pandas._libs.hashtable.Int64HashTable.get_item - python-2.7

I have the following code operating on data frame df:
print df
categories = df['my_classification'].unique()
for c in categories:
print c
win = df[df.result == 'Won'][df['my_classification'] == c]['prob'][0]
print type(win)
lost = df[df.result == 'Lost'][df['my_classification'] == c]['prob'][0]
print type(lost)
Then I got the following output:
result my_classification prob
0 Won ENTERPRISE 0.657895
1 Won COMMERCIAL 0.342105
2 Lost ENTERPRISE 0.611842
3 Lost COMMERCIAL 0.388158
ENTERPRISE
<type 'numpy.float64'>
And the errors:
There was a problem running this cell
KeyError 0
KeyErrorTraceback (most recent call last)
<ipython-input-4-38a901f9868a> in <module>()
38
39 print type(win)
---> 40 lost = df[df.result == 'Lost'][df['my_classification'] == c]['prob'][0]
41
42 print type(lost)
/opt/conda/envs/python2/lib/python2.7/site-packages/pandas/core/series.pyc in __getitem__(self, key)
599 key = com._apply_if_callable(key, self)
600 try:
--> 601 result = self.index.get_value(self, key)
602
603 if not is_scalar(result):
/opt/conda/envs/python2/lib/python2.7/site-packages/pandas/core/indexes/base.pyc in get_value(self, series, key)
2426 try:
2427 return self._engine.get_value(s, k,
-> 2428 tz=getattr(series.dtype, 'tz', None))
2429 except KeyError as e1:
2430 if len(self) > 0 and self.inferred_type in ['integer', 'boolean']:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value (pandas/_libs/index.c:4363)()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value (pandas/_libs/index.c:4046)()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5085)()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas/_libs/hashtable.c:13913)()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas/_libs/hashtable.c:13857)()
KeyError: 0
Here is what I don't understand: win and lost are of exactly the same format, why win was ok but lost generated an error? Thanks!

Cause you get the categories from whole dataframe, but for won and lost you filter them by the subset , sometime it dose not exist
For example df as follow
result my_classification prob
0 Won ENTERPRISE 0.657895
1 Won COMMERCIAL 0.342105
2 Lost ENTERPRISE 0.611842
when you do
df[df.result == 'Lost'][df['my_classification'] == 'COMMERCIAL']['prob'][0]
it will return the error
My solution by using groupby
df.groupby(['result','my_classification']).head(1)

Related

Pandas Merge Error iterable, not itertools.imap

I'm trying to merge two dataframes using the pandas merge code below. Each dataframe has just three columns. I've done similar merges before without issue. I've provided .info() on each dataframe. I'm getting an error about iterable vs not itertools.imap. I have no clue what they're talking about. Any tips very much appreciated.
Data:
pio_smp2_sm.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 12779 entries, 15 to 68311
Data columns (total 3 columns):
entityId 12779 non-null object
targetEntityId 12779 non-null object
eventTime 12779 non-null object
dtypes: object(3)
memory usage: 399.3+ KB
cm_smp2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 28035 entries, 40 to 698858
Data columns (total 3 columns):
user_id 28035 non-null object
product_id 28035 non-null object
time_stamp 28035 non-null object
dtypes: object(3)
memory usage: 876.1+ KB
Code:
comp_df2=pd.merge(pio_smp2_sm,cm_smp2,how='inner',left_on=['entityId','targetEntityId'],right_on=['user_id','product_id'])
Error:
TypeErrorTraceback (most recent call last)
<ipython-input-235-6882a22fe6a1> in <module>()
23
24
---> 25 comp_df2=pd.merge(pio_smp2_sm,cm_smp2,how='inner',left_on=['entityId','targetEntityId'],right_on=['user_id','product_id'])
26
27 # print(comp_df2.shape[0])
/data2/user/anaconda2/lib/python2.7/site-packages/pandas/core/reshape/merge.pyc in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
56 copy=copy, indicator=indicator,
57 validate=validate)
---> 58 return op.get_result()
59
60
/data2/user/anaconda2/lib/python2.7/site-packages/pandas/core/reshape/merge.pyc in get_result(self)
580 self.left, self.right)
581
--> 582 join_index, left_indexer, right_indexer = self._get_join_info()
583
584 ldata, rdata = self.left._data, self.right._data
/data2/user/anaconda2/lib/python2.7/site-packages/pandas/core/reshape/merge.pyc in _get_join_info(self)
746 else:
747 (left_indexer,
--> 748 right_indexer) = self._get_join_indexers()
749
750 if self.right_index:
/data2/user/anaconda2/lib/python2.7/site-packages/pandas/core/reshape/merge.pyc in _get_join_indexers(self)
725 self.right_join_keys,
726 sort=self.sort,
--> 727 how=self.how)
728
729 def _get_join_info(self):
/data2/user/anaconda2/lib/python2.7/site-packages/pandas/core/reshape/merge.pyc in _get_join_indexers(left_keys, right_keys, sort, how, **kwargs)
1048
1049 # get left & right join labels and num. of levels at each location
-> 1050 llab, rlab, shape = map(list, zip(* map(fkeys, left_keys, right_keys)))
1051
1052 # get flat i8 keys from label lists
TypeError: type object argument after * must be an iterable, not itertools.imap

pandas - group by: create aggregation function using multiple columns

I have the following data frame:
id my_year my_month waiting_time target
001 2018 1 95 1
002 2018 1 3 3
003 2018 1 4 0
004 2018 1 40 1
005 2018 2 97 1
006 2018 2 3 3
007 2018 3 4 0
008 2018 3 40 1
I want to groupby my_year and my_month, then in each group I want to compute the my_rate based on
(# of records with waiting_time <= 90 and target = 1)/ total_records in the group
i.e. I am expecting output like:
my_year my_month my_rate
2018 1 0.25
2018 2 0.0
2018 3 0.5
I wrote the following code to compute the desired value my_rate:
def my_rate(data):
waiting_time_list = data['waiting_time']
target_list = data['target']
total = len(data)
my_count = 0
for i in range(len(data)):
if total_waiting_time_list[i] <= 90 and target_list[i] == 1:
my_count += 1
rate = float(my_count)/float(total)
return rate
df.groupby(['my_year','my_month']).apply(my_rate)
However, I got the following error:
KeyError 0
KeyErrorTraceback (most recent call last)
<ipython-input-29-5c4399cefd05> in <module>()
17
---> 18 df.groupby(['my_year','my_month']).apply(my_rate)
/opt/conda/envs/python2/lib/python2.7/site-packages/pandas/core/groupby.pyc in apply(self, func, *args, **kwargs)
714 # ignore SettingWithCopy here in case the user mutates
715 with option_context('mode.chained_assignment', None):
--> 716 return self._python_apply_general(f)
717
718 def _python_apply_general(self, f):
/opt/conda/envs/python2/lib/python2.7/site-packages/pandas/core/groupby.pyc in _python_apply_general(self, f)
718 def _python_apply_general(self, f):
719 keys, values, mutated = self.grouper.apply(f, self._selected_obj,
--> 720 self.axis)
721
722 return self._wrap_applied_output(
/opt/conda/envs/python2/lib/python2.7/site-packages/pandas/core/groupby.pyc in apply(self, f, data, axis)
1727 # group might be modified
1728 group_axes = _get_axes(group)
-> 1729 res = f(group)
1730 if not _is_indexed_like(res, group_axes):
1731 mutated = True
<ipython-input-29-5c4399cefd05> in conversion_rate(data)
8 #print total_waiting_time_list[i], target_list[i]
9 #print i, total_waiting_time_list[i], target_list[i]
---> 10 if total_waiting_time_list[i] <= 90:# and target_list[i] == 1:
11 convert_90_count += 1
12 #print 'convert ', convert_90_count
/opt/conda/envs/python2/lib/python2.7/site-packages/pandas/core/series.pyc in __getitem__(self, key)
599 key = com._apply_if_callable(key, self)
600 try:
--> 601 result = self.index.get_value(self, key)
602
603 if not is_scalar(result):
/opt/conda/envs/python2/lib/python2.7/site-packages/pandas/core/indexes/base.pyc in get_value(self, series, key)
2426 try:
2427 return self._engine.get_value(s, k,
-> 2428 tz=getattr(series.dtype, 'tz', None))
2429 except KeyError as e1:
2430 if len(self) > 0 and self.inferred_type in ['integer', 'boolean']:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value (pandas/_libs/index.c:4363)()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value (pandas/_libs/index.c:4046)()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5085)()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas/_libs/hashtable.c:13913)()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas/_libs/hashtable.c:13857)()
KeyError: 0
Any idea what I did wrong here? And how do I fix it? Thanks!
I believe better is use mean of boolean mask per groups:
def my_rate(x):
return ((x['waiting_time'] <= 90) & (x['target'] == 1)).mean()
df = df.groupby(['my_year','my_month']).apply(my_rate).reset_index(name='my_rate')
print (df)
my_year my_month my_rate
0 2018 1 0.25
1 2018 2 0.00
2 2018 3 0.50
Any idea what I did wrong here?
Problem is waiting_time_list and target_list are not lists, but Series:
waiting_time_list = data['waiting_time']
target_list = data['target']
print (type(waiting_time_list))
<class 'pandas.core.series.Series'>
print (type(target_list))
<class 'pandas.core.series.Series'>
So if want indexing it failed, because in second group are indices 4,5, not 0,1.
if waiting_time_list[i] <= 90 and target_list[i] == 1:
For avoid it is possible convert Series to list:
waiting_time_list = data['waiting_time'].tolist()
target_list = data['target'].tolist()

Removing part of a value in a certain column in a dataframe , and returning a DF

I have the following Data Frame named: mydf:
A B
0 3de (1ABS) Adiran
1 3SA (SDAS) Adel
2 7A (ASA) Ronni
3 820 (SAAa) Emili
I want to remove the " (xxxx)" and keeps the values in column A , so the dataframe (mydf) will look like:
A B
0 3de Adiran
1 3SA Adel
2 7A Ronni
3 820 Emili
I have tried :
print mydf['A'].apply(lambda x: re.sub(r" \(.+\)", "", x) )
but then I get a Series object back and not a dataframe object.
I have also tried to use replace:
df.replace([' \(.*\)'],[""], regex=True), But it didn't change anything.
What am I doing wrong?
Thank you!
you can use str.split() method:
In [3]: df.A = df.A.str.split('\s+\(').str[0]
In [4]: df
Out[4]:
A B
0 3de Adiran
1 3SA Adel
2 7A Ronni
3 820 Emili
or using str.extract() method:
In [9]: df.A = df.A.str.extract(r'([^\(\s]*)', expand=False)
In [10]: df
Out[10]:
A B
0 3de Adiran
1 3SA Adel
2 7A Ronni
3 820 Emili

why do I keep getting the same answer with this conditional, python pandas

This might be a really dumb problem but I've been stuck on it for awhile.
Here's the csv
DATE,TIME,OPEN,HIGH,LOW,CLOSE,VOLUME
02/03/1997,09:30:00,3045.00,3045.00,3045.00,3045.00,28
02/04/1997,09:30:00,3077.00,3078.00,3077.00,3077.50,280
02/05/1997,09:30:00,3094.00,3094.50,3094.00,3094.00,50
02/06/1997,09:30:00,3106.00,3107.50,3106.00,3107.50,53
02/07/1997,09:30:00,3144.00,3144.00,3143.50,3143.50,15
02/06/1997,16:20:00,3126.50,3126.50,3126.00,3126.00,24
02/06/1997,16:21:00,3126.50,3128.00,3126.50,3128.00,169
02/06/1997,16:22:00,3128.00,3128.00,3126.00,3126.00,243
02/06/1997,16:23:00,3125.50,3126.50,3125.50,3125.50,26
This is just example i made from the original cause the original is really long. I moved all the "09:30:00" to the top to make it easier.
But here's my code.
df = pd.read_csv('example.txt', parse_dates = [["DATE", "TIME"]], index_col=0)
b930 = df.HIGH.at_time("09:30:00")
a=0
if 'b930 < 3044.00':
a = 7
else:
a = 10
print a
If I run it this way i get a 7 which i probably shouldn't be.
a=0
if 'b930 > 3044.00':
a = 7
else:
a = 10
print a
And if I run it this way i get a 7 which is good.
I've honestly tried a bunch of other things but I erase them.
You works with Series, so you have to use all or any:
b930 = df.HIGH.at_time("09:30:00")
print b930
DATE_TIME
1997-02-03 09:30:00 3045.0
1997-02-04 09:30:00 3078.0
1997-02-05 09:30:00 3094.5
1997-02-06 09:30:00 3107.5
1997-02-07 09:30:00 3144.0
#ValueError: The truth value of a Series is ambiguous.
# Use a.empty, a.bool(), a.item(), a.any() or a.all().
if b930 < 3044.00:
a = 7
else:
a = 10
print a
Check if all values are True:
print b930 < 3046.00
DATE_TIME
1997-02-03 09:30:00 True
1997-02-04 09:30:00 False
1997-02-05 09:30:00 False
1997-02-06 09:30:00 False
1997-02-07 09:30:00 False
Name: HIGH, dtype: bool
a=0
if (b930 < 3046.00).all():
a = 7
else:
a = 10
print a
10
Check if any values is True:
if (b930 < 3046.00).any():
a = 7
else:
a = 10
print a
7
Another example:
print b930 > 3044.00
DATE_TIME
1997-02-03 09:30:00 True
1997-02-04 09:30:00 True
1997-02-05 09:30:00 True
1997-02-06 09:30:00 True
1997-02-07 09:30:00 True
Name: HIGH, dtype: bool
a=0
if (b930 > 3044.00).all():
a = 7
else:
a = 10
print a
7
if (b930 > 3044.00).any():
a = 7
else:
a = 10
print a
7
This is a non-empty string and will always be cast to True:
'b930 < 3044.00'
change it to:
b930 < 3044.00

Using If/Truth Statements with pandas

I tried referencing the pandas documentation but still can't figure out how to proceed.
I have this data
In [6]:
df
Out[6]:
strike putCall
0 50 C
1 55 P
2 60 C
3 65 C
4 70 C
5 75 P
6 80 P
7 85 C
8 90 P
9 95 C
10 100 C
11 105 P
12 110 P
13 115 C
14 120 P
15 125 C
16 130 C
17 135 P
18 140 C
19 145 C
20 150 C
and am trying to run this code:
if df['putCall'] == 'P':
if df['strike']<100:
df['optVol'] = 1
else:
df['optVol'] = -999
else:
if df['strike']>df['avg_syn']:
df['optVol'] = 1
else:
df['optVol']= =-999
I get an error message:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
the above code and data are example only to illustrate the problem I ran into
Any assistance would be appreciated.
John
OP add-on
The above question was answered very well by Joris, but I have a slight add-on question.
how would I call a function such as
def Bool2(df):
df['optVol'] = df['strike']/100
return(df)
rather than assign the value of optVol directly to 1 in the line:
df.loc[(df['putCall'] == 'P') & (df['strike']>=100), 'optVol'] = 1
I would like to have the function Bool2 called and do the assigning. Obviously, the Bool2 function is much more complicated than I have portrayed.
I tried this (shot in the dark), but it did not work:
df.loc[(df['putCall'] == 'P') & (df['strike']<100), 'optVol'] =df.apply(Bool2,axis=1)
thanks again for the help
Typically, when you want to set values using such a if-else logic, boolean indexing is the solution (see docs):
The logic in:
if df['strike']<100:
df['optVol'] = 1
can be expressed with boolean indexing as:
df.loc[df['strike'] < 100, 'optVol'] = 1
For your example, you have multiple nested if-else, and then you can combine conditions using &:
df.loc[(df['putCall'] == 'P') & (df['strike']>=100), 'optVol'] = 1
The full equivalent of your code above could be like this:
df['optVol'] = -999
df.loc[(df['putCall'] == 'P') & (df['strike']>=100), 'optVol'] = 1
df.loc[(df['putCall'] != 'P') & (df['strike']>df['avg_syn']), 'optVol'] = 1
The reason you get the error message above is because when doing if df['strike']<100 this comparison works elementwise, so df['strike']<100 gives you a Series of True and False values, while if expects a single True or False value.