Using If/Truth Statements with pandas

Using If/Truth Statements with pandas - if-statement

I tried referencing the pandas documentation but still can't figure out how to proceed.
I have this data
In [6]:
df
Out[6]:
strike putCall
0 50 C
1 55 P
2 60 C
3 65 C
4 70 C
5 75 P
6 80 P
7 85 C
8 90 P
9 95 C
10 100 C
11 105 P
12 110 P
13 115 C
14 120 P
15 125 C
16 130 C
17 135 P
18 140 C
19 145 C
20 150 C
and am trying to run this code:
if df['putCall'] == 'P':
if df['strike']<100:
df['optVol'] = 1
else:
df['optVol'] = -999
else:
if df['strike']>df['avg_syn']:
df['optVol'] = 1
else:
df['optVol']= =-999
I get an error message:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
the above code and data are example only to illustrate the problem I ran into
Any assistance would be appreciated.
John
OP add-on
The above question was answered very well by Joris, but I have a slight add-on question.
how would I call a function such as
def Bool2(df):
df['optVol'] = df['strike']/100
return(df)
rather than assign the value of optVol directly to 1 in the line:
df.loc[(df['putCall'] == 'P') & (df['strike']>=100), 'optVol'] = 1
I would like to have the function Bool2 called and do the assigning. Obviously, the Bool2 function is much more complicated than I have portrayed.
I tried this (shot in the dark), but it did not work:
df.loc[(df['putCall'] == 'P') & (df['strike']<100), 'optVol'] =df.apply(Bool2,axis=1)
thanks again for the help

Typically, when you want to set values using such a if-else logic, boolean indexing is the solution (see docs):
The logic in:
if df['strike']<100:
df['optVol'] = 1
can be expressed with boolean indexing as:
df.loc[df['strike'] < 100, 'optVol'] = 1
For your example, you have multiple nested if-else, and then you can combine conditions using &:
df.loc[(df['putCall'] == 'P') & (df['strike']>=100), 'optVol'] = 1
The full equivalent of your code above could be like this:
df['optVol'] = -999
df.loc[(df['putCall'] == 'P') & (df['strike']>=100), 'optVol'] = 1
df.loc[(df['putCall'] != 'P') & (df['strike']>df['avg_syn']), 'optVol'] = 1
The reason you get the error message above is because when doing if df['strike']<100 this comparison works elementwise, so df['strike']<100 gives you a Series of True and False values, while if expects a single True or False value.

Related

How To Interpret Least Square Means and Standard Error

I am trying to understand the results I got for a fake dataset. I have two independent variables, hours, type and response pain.
First question: How was 82.46721 calculated as the lsmeans for the first type?
Second question: Why is the standard error exactly the same (8.24003) for both types?
Third question: Why is the degrees of freedom 3 for both types?
data = data.frame(
type = c("A", "A", "A", "B", "B", "B"),
hours = c(60,72,61, 54,68,66),
# pain = c(85,95,69, 73, 29, 30)
pain = c(85,95,69, 85,95,69)
)
model = lm(pain ~ hours + type, data = data)
lsmeans(model, c("type", "hours"))
> data
type hours pain
1 A 60 85
2 A 72 95
3 A 61 69
4 B 54 85
5 B 68 95
6 B 66 69
> lsmeans(model, c("type", "hours"))
type hours lsmean SE df lower.CL upper.CL
A 63.5 82.46721 8.24003 3 56.24376 108.6907
B 63.5 83.53279 8.24003 3 57.30933 109.7562

Try this:
newdat <- data.frame(type = c("A", "B"), hours = c(63.5, 63.5))
predict(model, newdata = newdat)
An important thing to note here is that your model has hours as a continuous predictor, not a factor.

pandas: pandas._libs.hashtable.Int64HashTable.get_item

I have the following code operating on data frame df:
print df
categories = df['my_classification'].unique()
for c in categories:
print c
win = df[df.result == 'Won'][df['my_classification'] == c]['prob'][0]
print type(win)
lost = df[df.result == 'Lost'][df['my_classification'] == c]['prob'][0]
print type(lost)
Then I got the following output:
result my_classification prob
0 Won ENTERPRISE 0.657895
1 Won COMMERCIAL 0.342105
2 Lost ENTERPRISE 0.611842
3 Lost COMMERCIAL 0.388158
ENTERPRISE
<type 'numpy.float64'>
And the errors:
There was a problem running this cell
KeyError 0
KeyErrorTraceback (most recent call last)
<ipython-input-4-38a901f9868a> in <module>()
38
39 print type(win)
---> 40 lost = df[df.result == 'Lost'][df['my_classification'] == c]['prob'][0]
41
42 print type(lost)
/opt/conda/envs/python2/lib/python2.7/site-packages/pandas/core/series.pyc in __getitem__(self, key)
599 key = com._apply_if_callable(key, self)
600 try:
--> 601 result = self.index.get_value(self, key)
602
603 if not is_scalar(result):
/opt/conda/envs/python2/lib/python2.7/site-packages/pandas/core/indexes/base.pyc in get_value(self, series, key)
2426 try:
2427 return self._engine.get_value(s, k,
-> 2428 tz=getattr(series.dtype, 'tz', None))
2429 except KeyError as e1:
2430 if len(self) > 0 and self.inferred_type in ['integer', 'boolean']:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value (pandas/_libs/index.c:4363)()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value (pandas/_libs/index.c:4046)()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5085)()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas/_libs/hashtable.c:13913)()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas/_libs/hashtable.c:13857)()
KeyError: 0
Here is what I don't understand: win and lost are of exactly the same format, why win was ok but lost generated an error? Thanks!

Cause you get the categories from whole dataframe, but for won and lost you filter them by the subset , sometime it dose not exist
For example df as follow
result my_classification prob
0 Won ENTERPRISE 0.657895
1 Won COMMERCIAL 0.342105
2 Lost ENTERPRISE 0.611842
when you do
df[df.result == 'Lost'][df['my_classification'] == 'COMMERCIAL']['prob'][0]
it will return the error
My solution by using groupby
df.groupby(['result','my_classification']).head(1)

pandas - group by: create aggregation function using multiple columns

I have the following data frame:
id my_year my_month waiting_time target
001 2018 1 95 1
002 2018 1 3 3
003 2018 1 4 0
004 2018 1 40 1
005 2018 2 97 1
006 2018 2 3 3
007 2018 3 4 0
008 2018 3 40 1
I want to groupby my_year and my_month, then in each group I want to compute the my_rate based on
(# of records with waiting_time <= 90 and target = 1)/ total_records in the group
i.e. I am expecting output like:
my_year my_month my_rate
2018 1 0.25
2018 2 0.0
2018 3 0.5
I wrote the following code to compute the desired value my_rate:
def my_rate(data):
waiting_time_list = data['waiting_time']
target_list = data['target']
total = len(data)
my_count = 0
for i in range(len(data)):
if total_waiting_time_list[i] <= 90 and target_list[i] == 1:
my_count += 1
rate = float(my_count)/float(total)
return rate
df.groupby(['my_year','my_month']).apply(my_rate)
However, I got the following error:
KeyError 0
KeyErrorTraceback (most recent call last)
<ipython-input-29-5c4399cefd05> in <module>()
17
---> 18 df.groupby(['my_year','my_month']).apply(my_rate)
/opt/conda/envs/python2/lib/python2.7/site-packages/pandas/core/groupby.pyc in apply(self, func, *args, **kwargs)
714 # ignore SettingWithCopy here in case the user mutates
715 with option_context('mode.chained_assignment', None):
--> 716 return self._python_apply_general(f)
717
718 def _python_apply_general(self, f):
/opt/conda/envs/python2/lib/python2.7/site-packages/pandas/core/groupby.pyc in _python_apply_general(self, f)
718 def _python_apply_general(self, f):
719 keys, values, mutated = self.grouper.apply(f, self._selected_obj,
--> 720 self.axis)
721
722 return self._wrap_applied_output(
/opt/conda/envs/python2/lib/python2.7/site-packages/pandas/core/groupby.pyc in apply(self, f, data, axis)
1727 # group might be modified
1728 group_axes = _get_axes(group)
-> 1729 res = f(group)
1730 if not _is_indexed_like(res, group_axes):
1731 mutated = True
<ipython-input-29-5c4399cefd05> in conversion_rate(data)
8 #print total_waiting_time_list[i], target_list[i]
9 #print i, total_waiting_time_list[i], target_list[i]
---> 10 if total_waiting_time_list[i] <= 90:# and target_list[i] == 1:
11 convert_90_count += 1
12 #print 'convert ', convert_90_count
/opt/conda/envs/python2/lib/python2.7/site-packages/pandas/core/series.pyc in __getitem__(self, key)
599 key = com._apply_if_callable(key, self)
600 try:
--> 601 result = self.index.get_value(self, key)
602
603 if not is_scalar(result):
/opt/conda/envs/python2/lib/python2.7/site-packages/pandas/core/indexes/base.pyc in get_value(self, series, key)
2426 try:
2427 return self._engine.get_value(s, k,
-> 2428 tz=getattr(series.dtype, 'tz', None))
2429 except KeyError as e1:
2430 if len(self) > 0 and self.inferred_type in ['integer', 'boolean']:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value (pandas/_libs/index.c:4363)()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value (pandas/_libs/index.c:4046)()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5085)()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas/_libs/hashtable.c:13913)()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item (pandas/_libs/hashtable.c:13857)()
KeyError: 0
Any idea what I did wrong here? And how do I fix it? Thanks!

I believe better is use mean of boolean mask per groups:
def my_rate(x):
return ((x['waiting_time'] <= 90) & (x['target'] == 1)).mean()
df = df.groupby(['my_year','my_month']).apply(my_rate).reset_index(name='my_rate')
print (df)
my_year my_month my_rate
0 2018 1 0.25
1 2018 2 0.00
2 2018 3 0.50
Any idea what I did wrong here?
Problem is waiting_time_list and target_list are not lists, but Series:
waiting_time_list = data['waiting_time']
target_list = data['target']
print (type(waiting_time_list))
<class 'pandas.core.series.Series'>
print (type(target_list))
<class 'pandas.core.series.Series'>
So if want indexing it failed, because in second group are indices 4,5, not 0,1.
if waiting_time_list[i] <= 90 and target_list[i] == 1:
For avoid it is possible convert Series to list:
waiting_time_list = data['waiting_time'].tolist()
target_list = data['target'].tolist()

Removing part of a value in a certain column in a dataframe , and returning a DF

I have the following Data Frame named: mydf:
A B
0 3de (1ABS) Adiran
1 3SA (SDAS) Adel
2 7A (ASA) Ronni
3 820 (SAAa) Emili
I want to remove the " (xxxx)" and keeps the values in column A , so the dataframe (mydf) will look like:
A B
0 3de Adiran
1 3SA Adel
2 7A Ronni
3 820 Emili
I have tried :
print mydf['A'].apply(lambda x: re.sub(r" \(.+\)", "", x) )
but then I get a Series object back and not a dataframe object.
I have also tried to use replace:
df.replace([' \(.*\)'],[""], regex=True), But it didn't change anything.
What am I doing wrong?
Thank you!

you can use str.split() method:
In [3]: df.A = df.A.str.split('\s+\(').str[0]
In [4]: df
Out[4]:
A B
0 3de Adiran
1 3SA Adel
2 7A Ronni
3 820 Emili
or using str.extract() method:
In [9]: df.A = df.A.str.extract(r'([^\(\s]*)', expand=False)
In [10]: df
Out[10]:
A B
0 3de Adiran
1 3SA Adel
2 7A Ronni
3 820 Emili

Find sum of the column values based on some other column

I have a input file like this:
j,z,b,bsy,afj,upz,343,13,ruhwd
u,i,a,dvp,ibt,dxv,154,00,adsif
t,a,a,jqj,dtd,yxq,540,49,kxthz
j,z,b,bsy,afj,upz,343,13,ruhwd
u,i,a,dvp,ibt,dxv,154,00,adsif
t,a,a,jqj,dtd,yxq,540,49,kxthz
c,u,g,nfk,ekh,trc,085,83,xppnl
For every unique value of Column1, I need to find out the sum of column7
Similarly, for every unique value of Column2, I need to find out the sum of column7
Output for 1 should be like:
j,686
u,308
t,98
c,83
Output for 2 should be like:
z,686
i,308
a,98
u,83
I am fairly new in Python. How can I achieve the above?

This could be done using Python's Counter and csv library as follows:
from collections import Counter
import csv
c1 = Counter()
c2 = Counter()
with open('input.csv') as f_input:
for cols in csv.reader(f_input):
col7 = int(cols[6])
c1[cols[0]] += col7
c2[cols[1]] += col7
print "Column 1"
for value, count in c1.iteritems():
print '{},{}'.format(value, count)
print "\nColumn 2"
for value, count in c2.iteritems():
print '{},{}'.format(value, count)
Giving you the following output:
Column 1
c,85
j,686
u,308
t,1080
Column 2
i,308
a,1080
z,686
u,85
A Counter is a type of Python dictionary that is useful for counting items automatically. c1 holds all of the column 1 entries and c2 holds all of the column 2 entries. Note, Python numbers lists starting from 0, so the first entry in a list is [0].
The csv library loads each line of the file into a list, with each entry in the list representing a different column. The code takes column 7 (i.e. cols[6]) and converts it into an integer, as all columns are held as strings. It is then added to the counter using either the column 1 or 2 value as the key. The result is two dictionaries holding the totaled counts for each key.

You can use pandas:
df = pd.read_csv('my_file.csv', header=None)
print(df.groupby(0)[6].sum())
print(df.groupby(1)[6].sum())
Output:
0
c 85
j 686
t 1080
u 308
Name: 6, dtype: int64
1
a 1080
i 308
u 85
z 686
Name: 6, dtype: int64
The data frame should look like this:
print(df.head())
Output:
0 1 2 3 4 5 6 7 8
0 j z b bsy afj upz 343 13 ruhwd
1 u i a dvp ibt dxv 154 0 adsif
2 t a a jqj dtd yxq 540 49 kxthz
3 j z b bsy afj upz 343 13 ruhwd
4 u i a dvp ibt dxv 154 0 adsif
You can also use your own names for the columns. Like c1, c2, ... c9:
df = pd.read_csv('my_file.csv', index_col=False, names=['c' + str(x) for x in range(1, 10)])
print(df)
Output:
c1 c2 c3 c4 c5 c6 c7 c8 c9
0 j z b bsy afj upz 343 13 ruhwd
1 u i a dvp ibt dxv 154 0 adsif
2 t a a jqj dtd yxq 540 49 kxthz
3 j z b bsy afj upz 343 13 ruhwd
4 u i a dvp ibt dxv 154 0 adsif
5 t a a jqj dtd yxq 540 49 kxthz
6 c u g nfk ekh trc 85 83 xppnl
Now, group by column 1 c1 or column c2 and sum up column 7 c7:
print(df.groupby(['c1'])['c7'].sum())
print(df.groupby(['c2'])['c7'].sum())
Output:
c1
c 85
j 686
t 1080
u 308
Name: c7, dtype: int64
c2
a 1080
i 308
u 85
z 686
Name: c7, dtype: int64

SO isn't supposed to be a code writing service, but I had a few minutes. :) Without Pandas you can do it with the CSV module;
import csv
def sum_to(results, key, add_value):
if key not in results:
results[key] = 0
results[key] += int(add_value)
column1_results = {}
column2_results = {}
with open("input.csv", 'rt') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
sum_to(column1_results, row[0], row[6])
sum_to(column2_results, row[1], row[6])
print column1_results
print column2_results
Results:
{'c': 85, 'j': 686, 'u': 308, 't': 1080}
{'i': 308, 'a': 1080, 'z': 686, 'u': 85}
Your expected results don't seem to match the math that Mike's answer and mine got using your spec. I'd double check that.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Using If/Truth Statements with pandas - if-statement

Related

How To Interpret Least Square Means and Standard Error

pandas: pandas._libs.hashtable.Int64HashTable.get_item

pandas - group by: create aggregation function using multiple columns

Removing part of a value in a certain column in a dataframe , and returning a DF

Find sum of the column values based on some other column

Categories

Resources