unclear pandas merge error - python-2.7

I have two dataframes like the ones below. I’m trying to merge them on the common field user_id. I’ve checked the syntax and I can not see what the issue is. I’m running python 2.7. Does anyone see the issue?
Code:
print s_data.columns
Index([u'user_id', u'bdn', u'preference_bdn'], dtype='object')
print data.columns
Index([u'user_id', u'bdn', u'preference_bdn'], dtype='object')
pd.merge[s_data, data, how='inner',left_on='user_id', right_on='user_id'].head()
Error:
File "<ipython-input-55-820f93556a69>", line 3
pd.merge[s_data, data how='inner',left_on='user_id', right_on='user_id'].head()
^
SyntaxError: invalid syntax
Data:
print s_data.head()
user_id bdn preference_bdn
0 4104910 vfs 0.95
1 4282779 vfs 1.00
2 5125665 MAIDE 0.65
3 5125665 SP 0.43
4 5125665 DK 0.11
print data.head()
user_id bdn preference_bdn
0 3949334 M 0.37
1 3949334 RAC. 0.37
2 3949334 B 0.19
3 3949334 TAY 0.19
4 4105144 AL 0.68

There should be a comma between data and how. Use this -
s_data.merge(data, how = 'inner', on ='user_id')

Related

Show Print Format in Jupyter Widgets

I have a result from the classification_report from sklearn.metrics and then print the report it would be like:
precision recall f1-score support
1 1.00 0.84 0.91 43
2 0.12 1.00 0.22 1
avg / total 0.98 0.84 0.90 44
Now, the question is how can I show the result in a Jupyter widget (in the above format) and update its value?
Currently, I am using html widgets to show the result:
#pass test and result vectors
report = classification_report(pred_test , self.y_test_data)
predict_table = widgets.HTML(value = "")
predict_table.value = report
but it likes the following:
precision recall f1-score support 1 1.00 0.81 0.90 43 2 0.00 0.00 0.00 0 avg / total 1.00 0.81 0.90 43
I found a simple solution using html techniques! As we are using html widget in Jupyter, the problem can be solved by using pre tag in html:
predict_table.value = "<pre>" + report + "</pre>

Boxplots lose "box" nature when plotting weighted data

I have the following data in Stata:
input drug halflife hl_weight
3 2.95 0.0066
2 6.00 0.0004
5 13.60 0.0006
1 2.82 0.0331
4 8.80 0.0001
4 1.24 0.0075
2 6.25 0.1123
4 17.20 0.0002
5 14.50 0.0020
4 5.50 0.0016
5 13.30 0.0003
4 8.26 0.0201
4 16.50 0.0103
4 11.40 0.0016
4 5.90 0.0005
4 3.99 0.0100
4 2.80 0.0073
4 3.00 0.0133
4 3.17 0.0061
4 4.95 0.1404
end
I am trying to create boxplots of drug halflives using the command below:
graph box halflife [aweight=hl_weight], over(drug)
When I include the weight option, some of the resulting box plots consist of multiple dots instead of the typical interquartile range and median:
Why does this happen and how can I fix it?
Obviously, this happens because of the weighting. The weights give more emphasis to values that are well outside the interquartile range.
I do not think there is anything to fix here. You could try to use the nooutsides option of the graph box command to hide the dots but i would not recommend it.

Counting gradient using 2 columns array from external .dat file

I have got a .dat file with 2 columns and rows between 14000 to 36000 saved in file like below:
0.00 0.00
2.00 1.00
2.03 1.01
2.05 1.07
.
.
.
79.03 23.01
The 1st column is extension, the 2nd is strain. When I want to count gradient to designate Hooks Law of the plot, I use below code.
CCCCCC
Program gradient
REAL S(40000),E(40000),GRAD(40000,1)
open(unit=300, file='Probka1A.dat', status='OLD')
open(unit=321, file='result.out', status='unknown')
write(321,400)
400 format('alfa')
260 DO 200 i=1, 40000
read(300,30) S(i),E(i)
30 format(2F7.2)
GRAD(i,1)=(S(i)-S(i-1))/(E(i)-E(i-1))
write(321,777) GRAD(i,1)
777 Format(F7.2)
200 Continue
END
But after I executed it I got the warning
PGFIO-F-231/formatted read/unit=300/error on data conversion.
File name = Probka1A.dat formatted, sequential access record = 1
In source file gradient1.f, at line number 9
What can I do to count gradient by this or other way in Fortran 77?
You are reading from file without checking for the end of the file. Your code should be like this:
260 DO 200 i=1, 40000
read(300,*,ERR=400,END=400) S(i),E(i)
if (i>1) then
GRAD(i-1,1)=(S(i)-S(i-1))/(E(i)-E(i-1))
write(321,777) GRAD(i-1,1)
end if
777 Format(F7.2)
200 Continue
400 continue

Python 2.7 Pandas: How to replace a for-loop?

I have a large pandas dataframe with 2000 rows (one date per row) and 2000 columns (1 second intervals). Each cell represents a temperature reading.
Starting with the 5th row, I need to go back 5 rows and find all the observations where the the 1st column in the row is higher than the 2nd column in the row.
For the 5th row I may find 2 such observations. I then want to do summary stats on the observations and append those summary stats to a list.
Then I go to the 6st row and go back 5 rows and find all the obvs where the 1th column is higher than the 2nd column. I get all obvs, do summary stats on the obvs and append the results to the new dataframe.
So for each row in the dataframe, I want to go back 5 days, get the obvs, get the stats, and append the stats to a dataframe.
The problem is that if I perform this operation on rows 5 -2000, then I will have a for-loop that is 1995 cycles long, and this takes a while.
What is the better or best way to do this?
Here is the code:
print huge_dataframe
sec_1 sec_2 sec_3 sec_4 sec_5
2013_12_27 0.05 0.12 0.06 0.15 0.14
2013_12_28 0.06 0.32 0.56 0.14 0.17
2013_12_29 0.07 0.52 0.36 0.13 0.13
2013_12_30 0.02 0.12 0.16 0.55 0.12
2013_12_31 0.06 0.30 0.06 0.14 0.01
2014_01_01 0.05 0.12 0.06 0.15 0.14
2014_01_02 0.06 0.32 0.56 0.14 0.17
2014_01_03 0.07 0.52 0.36 0.13 0.13
2014_01_04 0.02 0.12 0.16 0.55 0.12
2014_01_05 0.06 0.30 0.06 0.14 0.01
for each row in huge_dataframe.ix[5:]:
move = row[sec_1] - row[sec_2]
if move < 0: move = 'DOWN'
elif move > 0: move = 'UP'
relevant_dataframe = huge_dataframe.ix[only the 5 rows preceding the current row]
if move == 'UP':
mask = relevant_dataframe[sec_1 < sec_2] # creates a boolean dataframe
observations_df = relevant_dataframe[mask]
elif move == 'DOWN':
mask = relevant_dataframe[sec_1 > sec_2] # creates a boolean dataframe
observations_df = relevant_dataframe[mask]
# At this point I have observations_df which is only filled
# with rows where sec_1 < sec_2 or the opposite, depending on which
# row I am in.
summary_stats = str(observations_df.describe())
summary_list.append(summary_stats) # This is the goal
# I want to ultimatly
# turn the list into a
# dataframe
Since there is no code to create the data, I will just sketch the code that I would try to make work. Generally, try to prevent from row-wise operations whenever you can. I first had no clue either, but then I got interested and some research yielded TimeGrouper:
df = big_dataframe
df['move'] = df['sec_1'] > df['sec2']
def foobarRules(group):
# keep in mind that in here, you refer not to "relevant_dataframe", but to "group"
if (group.tail(1).move == True):
# some logic
else:
# some other logic
return str(group.describe())
grouper = TimeGrouper('5D')
allMyStatistics = df.groupby(grouper).apply(foobarRules)
I have honestly no clue how the return works if you return a multi-dimensional dataframe. I know it works well if you return either a row or a column, but if you return a dataframe that contains both rows and columns for every group - I guess pandas is smart enough to compute a panel of all these. Well, you will find out.

How should I format my .dat file so that a 3D vector plot can be made?

I'm working this programming task for college where we have to write a c++ program that calculates the magnetic field vector for certain coils in 3D space.
I've managed to write this program and I think I've got it working pretty well.
I want to add in a special thinh though (it's my exam paper, so it has to be extra good!): I wan't to plot the vectors out.
I'm used to calling gnuplot from c++ (via piping) and this is what I usually do:
create an output stream that writes the data to a .dat file
open a gnuplot pipe
make gnuplot plot all the contents of the .dat
Since my data has always been 2D, xand y plots, I'm quite lost here. My question is:
How to format the .dat file (e.g. do I use braces to group vector components?)
what is the actual gnuplot command to plot a 3D vector field?
It'd be easy if I could format the .dat file like this:
# Px Py Pz Bx By Bz
1 0 2 0.7 0.5 0.25 #<= example data line
... more data ...
when the magnetic field vector in the point P=(1,0,2)equals a vector B=(0.7,0.5,0.25). This would be easy to program, the real question is: will this do ? and how to I plot it in gnuplot. (wow, I've asked the same question 3 times I guess).
Piping to gnuplot
Ok, since someone asked me to describe how I pipe (don't know if it's the right term thought) stuff to gnuplot. Here it is:
First open up a pipe and call it pipe:
FILE *pipe = popen("gnuplot -persist 2>/dev/null", "w");
Tell gnuplot what to do through the pipe:
fprintf(pipe, "set term x11 enhanced \n");
fprintf(pipe, "plot x^2 ti 'x^2' with lines\n");
notice the \nwhich is absolutely necessary. It is what executes the command.
close the pipe:
pclose(pipe);
The necessary library is called <fstream> I believe.
I made this simple example to show you how to draw a vector field. The output would be something like this pic:
The data example I used to plot this was:
# Px Py Pz Bx By Bz
0 0 0 0.8 0.8 0.45
0 0 1 0.5 0.7 0.35
0 0 2 0.7 0.5 0.25
0 1 0 0.65 0.65 0.50
0 1 1 0.6 0.6 0.3
0 1 2 0.45 0.45 0.20
1 0 0 0.5 0.7 0.35
1 0 1 0.75 0.75 0.4
1 0 2 0.85 0.85 0.25
1 1 0 0.90 0.85 0.23
1 1 1 0.95 0.86 0.20
1 1 2 0.98 0.88 0.13
2 0 0 0.73 0.83 0.43
2 0 1 0.53 0.73 0.33
2 0 2 0.73 0.53 0.23
2 1 0 0.68 0.68 0.52
2 1 1 0.63 0.57 0.23
2 1 2 0.48 0.42 0.22
The command to plot it is:
gnuplot> splot "./data3d.dat" with vectors
Now you should read the section 44, page 53 of the official manual (and here the pdf). You may find this site also very useful.
Edited:
This command doesn't fit into your description: mapping from (x,y,z) to (t,u,v). It actually does this mapping: from (X,Y,Z) to (X+dX,Y+dY,Z+dZ).
Cheers,
Beco