Show Print Format in Jupyter Widgets - python-2.7

I have a result from the classification_report from sklearn.metrics and then print the report it would be like:
precision recall f1-score support
1 1.00 0.84 0.91 43
2 0.12 1.00 0.22 1
avg / total 0.98 0.84 0.90 44
Now, the question is how can I show the result in a Jupyter widget (in the above format) and update its value?
Currently, I am using html widgets to show the result:
#pass test and result vectors
report = classification_report(pred_test , self.y_test_data)
predict_table = widgets.HTML(value = "")
predict_table.value = report
but it likes the following:
precision recall f1-score support 1 1.00 0.81 0.90 43 2 0.00 0.00 0.00 0 avg / total 1.00 0.81 0.90 43

I found a simple solution using html techniques! As we are using html widget in Jupyter, the problem can be solved by using pre tag in html:
predict_table.value = "<pre>" + report + "</pre>

Related

unclear pandas merge error

I have two dataframes like the ones below. I’m trying to merge them on the common field user_id. I’ve checked the syntax and I can not see what the issue is. I’m running python 2.7. Does anyone see the issue?
Code:
print s_data.columns
Index([u'user_id', u'bdn', u'preference_bdn'], dtype='object')
print data.columns
Index([u'user_id', u'bdn', u'preference_bdn'], dtype='object')
pd.merge[s_data, data, how='inner',left_on='user_id', right_on='user_id'].head()
Error:
File "<ipython-input-55-820f93556a69>", line 3
pd.merge[s_data, data how='inner',left_on='user_id', right_on='user_id'].head()
^
SyntaxError: invalid syntax
Data:
print s_data.head()
user_id bdn preference_bdn
0 4104910 vfs 0.95
1 4282779 vfs 1.00
2 5125665 MAIDE 0.65
3 5125665 SP 0.43
4 5125665 DK 0.11
print data.head()
user_id bdn preference_bdn
0 3949334 M 0.37
1 3949334 RAC. 0.37
2 3949334 B 0.19
3 3949334 TAY 0.19
4 4105144 AL 0.68
There should be a comma between data and how. Use this -
s_data.merge(data, how = 'inner', on ='user_id')

Merge Data From Different Files By Python

I have many dataset from files to be merged and arranged in one single output file. Here is the example of any two datasets to be merged accordingly.
Data 1 from File 1:
9.00 2.80 13.08 12.78 0.73
10.00 -3.44 19.30 18.99 0.14
12.00 2.60 20.28 20.12 0.39
Data 2 from File 2:
2.00 -7.73 20.04 18.49 0.62
5.00 -4.82 17.07 16.38 0.59
6.00 -2.69 12.55 12.25 0.50
8.00 -3.85 18.06 17.64 0.94
9.00 -3.59 16.13 15.73 0.64
Expected output in one file:
9.00 2.80 13.08 12.78 0.73
10.00 -3.44 19.30 18.99 0.14
12.00 2.60 20.28 20.12 0.39
2.00 -7.73 20.04 18.49 0.62
5.00 -4.82 17.07 16.38 0.59
6.00 -2.69 12.55 12.25 0.50
8.00 -3.85 18.06 17.64 0.94
9.00 -3.59 16.13 15.73 0.64
Temporarily the script i used using Python loop for is like this:
import numpy as np
import glob
path='./13-stat-plot-extreme-combine/'
files=glob.glob(path+'13-stat*.dat')
for x in range(len(files)):
file1=files[x]
data1=np.loadtxt(file1)
np.savetxt("Combine-Stats.dat",data1,fmt='%9.2f')
The problem is only one dataset is saved on that new file. Question how to use concatenate to such case at different axis dataset?
Like this:
arrays = [np.loadtxt(name) for name in files]
combined = np.concatenate(arrays)

Python 2.7 Pandas: How to replace a for-loop?

I have a large pandas dataframe with 2000 rows (one date per row) and 2000 columns (1 second intervals). Each cell represents a temperature reading.
Starting with the 5th row, I need to go back 5 rows and find all the observations where the the 1st column in the row is higher than the 2nd column in the row.
For the 5th row I may find 2 such observations. I then want to do summary stats on the observations and append those summary stats to a list.
Then I go to the 6st row and go back 5 rows and find all the obvs where the 1th column is higher than the 2nd column. I get all obvs, do summary stats on the obvs and append the results to the new dataframe.
So for each row in the dataframe, I want to go back 5 days, get the obvs, get the stats, and append the stats to a dataframe.
The problem is that if I perform this operation on rows 5 -2000, then I will have a for-loop that is 1995 cycles long, and this takes a while.
What is the better or best way to do this?
Here is the code:
print huge_dataframe
sec_1 sec_2 sec_3 sec_4 sec_5
2013_12_27 0.05 0.12 0.06 0.15 0.14
2013_12_28 0.06 0.32 0.56 0.14 0.17
2013_12_29 0.07 0.52 0.36 0.13 0.13
2013_12_30 0.02 0.12 0.16 0.55 0.12
2013_12_31 0.06 0.30 0.06 0.14 0.01
2014_01_01 0.05 0.12 0.06 0.15 0.14
2014_01_02 0.06 0.32 0.56 0.14 0.17
2014_01_03 0.07 0.52 0.36 0.13 0.13
2014_01_04 0.02 0.12 0.16 0.55 0.12
2014_01_05 0.06 0.30 0.06 0.14 0.01
for each row in huge_dataframe.ix[5:]:
move = row[sec_1] - row[sec_2]
if move < 0: move = 'DOWN'
elif move > 0: move = 'UP'
relevant_dataframe = huge_dataframe.ix[only the 5 rows preceding the current row]
if move == 'UP':
mask = relevant_dataframe[sec_1 < sec_2] # creates a boolean dataframe
observations_df = relevant_dataframe[mask]
elif move == 'DOWN':
mask = relevant_dataframe[sec_1 > sec_2] # creates a boolean dataframe
observations_df = relevant_dataframe[mask]
# At this point I have observations_df which is only filled
# with rows where sec_1 < sec_2 or the opposite, depending on which
# row I am in.
summary_stats = str(observations_df.describe())
summary_list.append(summary_stats) # This is the goal
# I want to ultimatly
# turn the list into a
# dataframe
Since there is no code to create the data, I will just sketch the code that I would try to make work. Generally, try to prevent from row-wise operations whenever you can. I first had no clue either, but then I got interested and some research yielded TimeGrouper:
df = big_dataframe
df['move'] = df['sec_1'] > df['sec2']
def foobarRules(group):
# keep in mind that in here, you refer not to "relevant_dataframe", but to "group"
if (group.tail(1).move == True):
# some logic
else:
# some other logic
return str(group.describe())
grouper = TimeGrouper('5D')
allMyStatistics = df.groupby(grouper).apply(foobarRules)
I have honestly no clue how the return works if you return a multi-dimensional dataframe. I know it works well if you return either a row or a column, but if you return a dataframe that contains both rows and columns for every group - I guess pandas is smart enough to compute a panel of all these. Well, you will find out.

`gprof` time spent in particular lines of code

I've been using the gprof profiler in conjunction with g++.
I have a function in my code which encapsulates several sections of behaviour which are related enough to the primary function that it would not make sense to split them off into their own functions.
I'd like to know how much time is spent in each of these areas of code.
So, if you imagine the code looks like
function(){
A
A
A
B
B
B
C
C
C
}
where A, B, and C represent particular sections of code I'm interested in, is there a way to get gprof to tell me how much time is spent working on those particular sections?
I know it's a old question, but I have found a interesting answer.
As Sam say, the -l option is only for old gcc compiler. But I have found that if you compile and link with -pg -fprofile-arcs -ftest-coverage, run the program, the result of gprof -l is very interesting.
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
13.86 0.26 0.26 main (ComAnalyste.c:450 # 804b315)
10.87 0.46 0.20 main (ComAnalyste.c:386 # 804b151)
7.07 0.59 0.13 main (ComAnalyste.c:437 # 804b211)
6.25 0.70 0.12 main (ComAnalyste.c:436 # 804b425)
4.89 0.79 0.09 main (ComAnalyste.c:283 # 804a3f4)
4.89 0.88 0.09 main (ComAnalyste.c:436 # 804b1e9)
4.08 0.96 0.08 main (ComAnalyste.c:388 # 804ad95)
3.81 1.03 0.07 main (ComAnalyste.c:293 # 804a510)
3.53 1.09 0.07 main (ComAnalyste.c:401 # 804af04)
3.26 1.15 0.06 main (ComAnalyste.c:293 # 804a4bf)
2.72 1.20 0.05 main (ComAnalyste.c:278 # 804a48d)
2.72 1.25 0.05 main (ComAnalyste.c:389 # 804adae)
2.72 1.30 0.05 main (ComAnalyste.c:406 # 804aecb)
2.45 1.35 0.05 main (ComAnalyste.c:386 # 804ad6d)
2.45 1.39 0.05 main (ComAnalyste.c:443 # 804b248)
2.45 1.44 0.05 main (ComAnalyste.c:446 # 804b2f4)
2.17 1.48 0.04 main (ComAnalyste.c:294 # 804a4e4)
2.17 1.52 0.04 main (ComAnalyste.c:459 # 804b43b)
1.63 1.55 0.03 main (ComAnalyste.c:442 # 804b22d)
1.63 1.58 0.03 main (ComAnalyste.c:304 # 804a56d)
1.09 1.60 0.02 main (ComAnalyste.c:278 # 804a3b3)
1.09 1.62 0.02 main (ComAnalyste.c:285 # 804a450)
1.09 1.64 0.02 main (ComAnalyste.c:286 # 804a470)
1.09 1.66 0.02 main (ComAnalyste.c:302 # 804acdf)
0.82 1.67 0.02 main (ComAnalyste.c:435 # 804b1d2)
0.54 1.68 0.01 main (ComAnalyste.c:282 # 804a3db)
0.54 1.69 0.01 main (ComAnalyste.c:302 # 804a545)
0.54 1.70 0.01 main (ComAnalyste.c:307 # 804a586)
0.54 1.71 0.01 main (ComAnalyste.c:367 # 804ac1a)
0.54 1.72 0.01 main (ComAnalyste.c:395 # 804ade6)
0.54 1.73 0.01 main (ComAnalyste.c:411 # 804aff8)
0.54 1.74 0.01 main (ComAnalyste.c:425 # 804b12a)
0.54 1.75 0.01 main (ComAnalyste.c:429 # 804b19f)
0.54 1.76 0.01 main (ComAnalyste.c:444 # 804b26f)
0.54 1.77 0.01 main (ComAnalyste.c:464 # 804b4a1)
0.54 1.78 0.01 main (ComAnalyste.c:469 # 804b570)
0.54 1.79 0.01 main (ComAnalyste.c:472 # 804b5b9)
0.27 1.80 0.01 main (ComAnalyste.c:308 # 804a5a3)
0.27 1.80 0.01 main (ComAnalyste.c:309 # 804a5a9)
0.27 1.81 0.01 main (ComAnalyste.c:349 # 804a974)
0.27 1.81 0.01 main (ComAnalyste.c:350 # 804a99c)
0.27 1.82 0.01 main (ComAnalyste.c:402 # 804af1d)
0.27 1.82 0.01 main (ComAnalyste.c:416 # 804b073)
0.27 1.83 0.01 main (ComAnalyste.c:417 # 804b0a1)
0.27 1.83 0.01 main (ComAnalyste.c:454 # 804b3ec)
0.27 1.84 0.01 main (ComAnalyste.c:461 # 804b44a)
0.27 1.84 0.01 main (ComAnalyste.c:462 # 804b458)
It's say the time spent per line. It's very interesting result.
I don't know the accuracy or the validity of that, but it's quite interesting.
Hope it's help
Here's a useful resource for you: gprof line by line profiling.
With older versions of the gcc compiler, the gprof -l argument specified line by line profiling.
However, newer versions of gcc use the gcov tool instead of gprof to display line by line profiling information.
If you are using linux, then you can use linux perf instead of gprof, as described here:
http://code.google.com/p/jrfonseca/wiki/Gprof2Dot#linux_perf
Typing perf report and selecting a function will allow you to get line-by-line information about where the CPU time is spent inside the function.

How should I format my .dat file so that a 3D vector plot can be made?

I'm working this programming task for college where we have to write a c++ program that calculates the magnetic field vector for certain coils in 3D space.
I've managed to write this program and I think I've got it working pretty well.
I want to add in a special thinh though (it's my exam paper, so it has to be extra good!): I wan't to plot the vectors out.
I'm used to calling gnuplot from c++ (via piping) and this is what I usually do:
create an output stream that writes the data to a .dat file
open a gnuplot pipe
make gnuplot plot all the contents of the .dat
Since my data has always been 2D, xand y plots, I'm quite lost here. My question is:
How to format the .dat file (e.g. do I use braces to group vector components?)
what is the actual gnuplot command to plot a 3D vector field?
It'd be easy if I could format the .dat file like this:
# Px Py Pz Bx By Bz
1 0 2 0.7 0.5 0.25 #<= example data line
... more data ...
when the magnetic field vector in the point P=(1,0,2)equals a vector B=(0.7,0.5,0.25). This would be easy to program, the real question is: will this do ? and how to I plot it in gnuplot. (wow, I've asked the same question 3 times I guess).
Piping to gnuplot
Ok, since someone asked me to describe how I pipe (don't know if it's the right term thought) stuff to gnuplot. Here it is:
First open up a pipe and call it pipe:
FILE *pipe = popen("gnuplot -persist 2>/dev/null", "w");
Tell gnuplot what to do through the pipe:
fprintf(pipe, "set term x11 enhanced \n");
fprintf(pipe, "plot x^2 ti 'x^2' with lines\n");
notice the \nwhich is absolutely necessary. It is what executes the command.
close the pipe:
pclose(pipe);
The necessary library is called <fstream> I believe.
I made this simple example to show you how to draw a vector field. The output would be something like this pic:
The data example I used to plot this was:
# Px Py Pz Bx By Bz
0 0 0 0.8 0.8 0.45
0 0 1 0.5 0.7 0.35
0 0 2 0.7 0.5 0.25
0 1 0 0.65 0.65 0.50
0 1 1 0.6 0.6 0.3
0 1 2 0.45 0.45 0.20
1 0 0 0.5 0.7 0.35
1 0 1 0.75 0.75 0.4
1 0 2 0.85 0.85 0.25
1 1 0 0.90 0.85 0.23
1 1 1 0.95 0.86 0.20
1 1 2 0.98 0.88 0.13
2 0 0 0.73 0.83 0.43
2 0 1 0.53 0.73 0.33
2 0 2 0.73 0.53 0.23
2 1 0 0.68 0.68 0.52
2 1 1 0.63 0.57 0.23
2 1 2 0.48 0.42 0.22
The command to plot it is:
gnuplot> splot "./data3d.dat" with vectors
Now you should read the section 44, page 53 of the official manual (and here the pdf). You may find this site also very useful.
Edited:
This command doesn't fit into your description: mapping from (x,y,z) to (t,u,v). It actually does this mapping: from (X,Y,Z) to (X+dX,Y+dY,Z+dZ).
Cheers,
Beco