Python 2.7 Pandas: How to replace a for-loop? - python-2.7

I have a large pandas dataframe with 2000 rows (one date per row) and 2000 columns (1 second intervals). Each cell represents a temperature reading.
Starting with the 5th row, I need to go back 5 rows and find all the observations where the the 1st column in the row is higher than the 2nd column in the row.
For the 5th row I may find 2 such observations. I then want to do summary stats on the observations and append those summary stats to a list.
Then I go to the 6st row and go back 5 rows and find all the obvs where the 1th column is higher than the 2nd column. I get all obvs, do summary stats on the obvs and append the results to the new dataframe.
So for each row in the dataframe, I want to go back 5 days, get the obvs, get the stats, and append the stats to a dataframe.
The problem is that if I perform this operation on rows 5 -2000, then I will have a for-loop that is 1995 cycles long, and this takes a while.
What is the better or best way to do this?
Here is the code:
print huge_dataframe
sec_1 sec_2 sec_3 sec_4 sec_5
2013_12_27 0.05 0.12 0.06 0.15 0.14
2013_12_28 0.06 0.32 0.56 0.14 0.17
2013_12_29 0.07 0.52 0.36 0.13 0.13
2013_12_30 0.02 0.12 0.16 0.55 0.12
2013_12_31 0.06 0.30 0.06 0.14 0.01
2014_01_01 0.05 0.12 0.06 0.15 0.14
2014_01_02 0.06 0.32 0.56 0.14 0.17
2014_01_03 0.07 0.52 0.36 0.13 0.13
2014_01_04 0.02 0.12 0.16 0.55 0.12
2014_01_05 0.06 0.30 0.06 0.14 0.01
for each row in huge_dataframe.ix[5:]:
move = row[sec_1] - row[sec_2]
if move < 0: move = 'DOWN'
elif move > 0: move = 'UP'
relevant_dataframe = huge_dataframe.ix[only the 5 rows preceding the current row]
if move == 'UP':
mask = relevant_dataframe[sec_1 < sec_2] # creates a boolean dataframe
observations_df = relevant_dataframe[mask]
elif move == 'DOWN':
mask = relevant_dataframe[sec_1 > sec_2] # creates a boolean dataframe
observations_df = relevant_dataframe[mask]
# At this point I have observations_df which is only filled
# with rows where sec_1 < sec_2 or the opposite, depending on which
# row I am in.
summary_stats = str(observations_df.describe())
summary_list.append(summary_stats) # This is the goal
# I want to ultimatly
# turn the list into a
# dataframe

Since there is no code to create the data, I will just sketch the code that I would try to make work. Generally, try to prevent from row-wise operations whenever you can. I first had no clue either, but then I got interested and some research yielded TimeGrouper:
df = big_dataframe
df['move'] = df['sec_1'] > df['sec2']
def foobarRules(group):
# keep in mind that in here, you refer not to "relevant_dataframe", but to "group"
if (group.tail(1).move == True):
# some logic
else:
# some other logic
return str(group.describe())
grouper = TimeGrouper('5D')
allMyStatistics = df.groupby(grouper).apply(foobarRules)
I have honestly no clue how the return works if you return a multi-dimensional dataframe. I know it works well if you return either a row or a column, but if you return a dataframe that contains both rows and columns for every group - I guess pandas is smart enough to compute a panel of all these. Well, you will find out.

Related

SAS: Putting observations in bin and keep the ones closest to it

I have a list of observations with a few variables. I need to put them in a bin (below) and only keep one observation in each bin which is closest to the bin's number:
Bins
0.94
0.96
0.98
1.00
1.02
1.04
1.06
Data
Variable Price Value_to_bin Closest bin
a 0.630527682 0.935 0.94
b 0.441296291 0.979 0.98
c 0.350173415 0.969
d 0.920932417 0.993
e 0.361863025 0.959 0.96
f 0.027205755 1.003 1
g 0.878286791 1.045
h 0.206434946 0.971
i 0.259272294 1.021 1.02
j 0.081774863 0.982
k 0.01146324 0.992
l 0.283027273 1.037 1.04
m 0.188747537 0.993
n 0.554786 1.064 1.06
o 0.784774 1.065
And then just keep the ones that are closest to the bin value (i.e. delete the ones that have blanks in the 'closest_bin' variable.
I tried to use proc rank but I can't get rid of the rest or match with the bin (something like 'closest' doesn't exist as far as I know).
SAS SQL with automatic remerging can perform the query quite succinctly. The consistent binning to a 0.02 level allows the ROUND function to compute the bin values to the nearest 0.02 unit.
proc sql;
create table want as
select
var,
price,
value,
round(value,0.02) as valbin_02
from have
group by valbin_02
having abs(valbin_02-value) = min(abs(valbin_02-value))
;

Show Print Format in Jupyter Widgets

I have a result from the classification_report from sklearn.metrics and then print the report it would be like:
precision recall f1-score support
1 1.00 0.84 0.91 43
2 0.12 1.00 0.22 1
avg / total 0.98 0.84 0.90 44
Now, the question is how can I show the result in a Jupyter widget (in the above format) and update its value?
Currently, I am using html widgets to show the result:
#pass test and result vectors
report = classification_report(pred_test , self.y_test_data)
predict_table = widgets.HTML(value = "")
predict_table.value = report
but it likes the following:
precision recall f1-score support 1 1.00 0.81 0.90 43 2 0.00 0.00 0.00 0 avg / total 1.00 0.81 0.90 43
I found a simple solution using html techniques! As we are using html widget in Jupyter, the problem can be solved by using pre tag in html:
predict_table.value = "<pre>" + report + "</pre>

Netlogo: runtime-error with list, item, and -1

I have a rather specific error in netlogo that I've been staring at for a while now. Hope you guys have some insight.
The error is in a code which looks back in a list called 'strategy'. If the list is longer than investment-time variables 'REfocus' and 'PRICE' are set to a certain value. If the list is not longer than investment-time, the variables are not set (and thus remain 0).
The code consists out of a function strategy_actions and a reporter investment_time. Investment-time is approximately 3 years, but as ticks are in months, investment-time is rescaled to months. In strategy_actions, investment-time is scaled back to years, as each entry in the strategy list is also annual. (The scaling and rescaling seems arbitrary, but as investment-time is used a lot by other parts of the code, it made more sense to do it like this). The goal is to take the strategy from x time back (equal to investment-time).
The code (error follows underneath):
to strategy_actions
set_ROI
start_supply?
if current_strategy != 0
[
let it (investment_time / 12)
ifelse it >= length strategy
[
set REfocus 0
]
[
if item (it - 1) strategy = 1
[
if supply? = true [set_PRICE (set_discrete_distribution 0.29 0.19 0.29 0.15 0.07 0 0) (set_discrete_distribution 0.14 0.12 0.25 0.25 0.25 0 0)]
ifelse any? ids [set REfocus mean [mot_RE] of ids][set REfocus set_discrete_distribution 0.07 0.03 0.07 0.17 0.66 0 0]
]
if item (it - 1) strategy = 2
[
if supply? = true [set_PRICE (set_discrete_distribution 0.27 0.21 0.32 0.11 0.09 0 0) (set_discrete_distribution 0.15 0.11 0.22 0.30 0.23 0 0)]
ifelse any? prods [set REfocus mean [mot_RE] of prods][set REfocus set_discrete_distribution 0.12 0.03 0.10 0.18 0.57 0 0]
]
if item (it - 1) strategy = 3
[
if supply? = true [set_PRICE (set_discrete_distribution 0.26 0.22 0.26 0.18 0.09 0 0) (set_discrete_distribution 0.07 0.08 0.19 0.30 0.35 0 0)]
ifelse any? cons[set REfocus mean [mot_RE] of cons][set REfocus set_discrete_distribution 0.08 0.06 0.15 0.27 0.45 0 0]
]
]
set RE_history fput REfocus RE_history
]
end
to-report investment_time
report ((random-normal 3 1) * 12) ;approximately 3 years investment time
end
somehow, i sometimes get this runtime error during my behaviorspace experiment:
-1 isn't greater than or equal to zero.
error while observer running ITEM
called by procedure STRATEGY_ACTIONS
called by procedure SET_MEETING_ACTIONS
called by procedure GO
Does anyone know what causes this error?
You would help me out a lot!
Cheers,
Maria
It appears that investment_time is occasionally coming in as zero, so you are asking for item (0 - 1) of the strategy list. I did a bit of playing around with item and learned (to my surprise) that item (0.0001 - 1) strategy works just fine, yielding the 0th item in the list in spite of the argument being negative. But item (0 - 1) strategy does give the error you cite. Apparently an item number greater than -1 is interpreted as zero. Indeed item seems to truncate any fractional argument rather than rounding it. E.g., item 0.9 is interpreted as item 0, as is item -0.9
That might be worth putting in the documentation.
HTH,
Charles

unclear pandas merge error

I have two dataframes like the ones below. I’m trying to merge them on the common field user_id. I’ve checked the syntax and I can not see what the issue is. I’m running python 2.7. Does anyone see the issue?
Code:
print s_data.columns
Index([u'user_id', u'bdn', u'preference_bdn'], dtype='object')
print data.columns
Index([u'user_id', u'bdn', u'preference_bdn'], dtype='object')
pd.merge[s_data, data, how='inner',left_on='user_id', right_on='user_id'].head()
Error:
File "<ipython-input-55-820f93556a69>", line 3
pd.merge[s_data, data how='inner',left_on='user_id', right_on='user_id'].head()
^
SyntaxError: invalid syntax
Data:
print s_data.head()
user_id bdn preference_bdn
0 4104910 vfs 0.95
1 4282779 vfs 1.00
2 5125665 MAIDE 0.65
3 5125665 SP 0.43
4 5125665 DK 0.11
print data.head()
user_id bdn preference_bdn
0 3949334 M 0.37
1 3949334 RAC. 0.37
2 3949334 B 0.19
3 3949334 TAY 0.19
4 4105144 AL 0.68
There should be a comma between data and how. Use this -
s_data.merge(data, how = 'inner', on ='user_id')

awk match between two patterns in an "if/else" statement

I've got an awk issue that I can't seem to figure out. I'm trying to parse out data from SAR and found that some systems are using a different locale and I'm getting different output. The long term solution is to change the locale on all systems for the output data to the same thing, but I have to parse through old data for now and that is not currently an option. Here's the two types of data I get:
24-Hour Output:
21:10:01 all 8.43 0.00 1.81 2.00 0.00 87.76
21:20:01 all 7.99 0.00 1.74 0.82 0.00 89.44
21:30:01 all 8.35 0.00 1.76 0.94 0.00 88.95
12-Hour Output:
09:10:01 PM all 8.43 0.00 1.81 2.00 0.00 87.76
09:20:01 PM all 7.99 0.00 1.74 0.82 0.00 89.44
09:30:01 PM all 8.35 0.00 1.76 0.94 0.00 88.95
I need an awk statement that will get items from 7AM-7PM for all SAR data. I originally had something working, but once I found this issue, it breaks for all the 24-hour output. I trying getting the awk statement to work, but the following doesn't work and I can't figure out how to make it work:
awk '{ if ($2 == "AM" || $2 == "PM" && /07:00/,/07:00/) print $1" "$2; else '/07:00/,/19:00 print $1}' SAR_OUTPUT_FILE.txt
Basically, what I'm trying to output is, if it is in 24-hour format, searchh for 07:00-19:00 and return just the first column of output (since there is no "AM/PM" column. If it founds "AM/PM", I would confider that 12-hour format and want to get everything from 07:00-07:00 and return both the 1st and 2nd column (time + "AM/PM").
Can anyone help me out here?
Without access to an awk with time functions ( strftime() or mktime() ), you can shift the 12h end times so that they can be tested with the 24h time test.
Here's an awk executable that does that by adjusting the hours in the 12h formatted times to fit 24h time formats. The result is put into variable t for every line and is tested to be in the 24h range.
#!/usr/bin/awk -f
function timeShift( a, h ) {
if(NF==9 && split($1, a, ":")==3) {
if(a[1]==12) h = $2=="PM"?"12":"00"
else if($2=="PM") h = (a[1]+12)%24
else h = a[1]
return( h ":" a[2] ":" a[3] )
}
return( $1 )
}
{ t = timeShift() }
t >= "07:00:00" && t <= "19:00:00"
If you need to print fewer fields than the full line, an action block could be added after the final expression.