I am reading data from this link: http://www.weerindelft.nl/clientraw.txt.
The main goal is to print out the temperature http://www.weerindelft.nl displays. I have discovered that its in that text file so i only need to print out the right part of the file.
This is my code:
import socket
from decimal import Decimal
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("www.weerindelft.nl" , 80))
s.sendall("GET http://www.weerindelft.nl/clientraw.txt HTTP/1.0\n\n")
write = s.recv(1427247693)
variable_1 = str(write[311:])
integer = float(variable_1[46:50])
tim = round(integer,0)
print Decimal(tim)
f = open("output.txt", "w")
f.write(write)
f.close
s.close()
This is my output:
HTTP/1.1 200 OK
Date: Wed, 04 Jan 2017 12:34:14 GMT
Server: Apache
Last-Modified: Wed, 04 Jan 2017 12:34:12 GMT
Vary: Accept-Encoding
Content-Type: text/plain
X-Varnish: 110069959 109349321
Age: 32
Via: 1.1 varnish (Varnish/5.0)
ETag: W/"b173bdaf-2fb-54544008156e2"
Accept-Ranges: bytes
Content-Length: 763
Connection: close
12345 7.0 7.8 318 5.4 85 1016.9 1.0 4.2 4.2 0.014 0.086 18.7 38 100.0 34 0.0 0 0 0.2 -100.0 255.0 -100.0 -100.0 -100.0 -100.0 -100 -100 -100 13 20 58 WeerinDelft-13:20:58 2 100 4 1 100 100 100 100 100 100 100 2.6 4.0 8.0 5.1 34 zonnig/Gestopt_met_regenen 0.2 4 4 4 7 5 5 8 6 6 5 6 6 4 4 4 4 5 6 9 8 30.4 3.0 949.9 4/1/2017 7.5 3.6 6.0 0.9 0.5 14 12 10 12 7 11 8 5 6 10 6.8 6.9 6.7 6.5 6.4 6.5 5.7 5.3 5.1 5.3 0.8 0.8 0.8 0.8 0.8 0.8 0.8 1.0 1.0 1.0 8.0 5.1 5.4 18.2 0 13:09:41 2017/04/01 326 522 91 -100.0 -100.0 5 0 0 0 0 102.0 18.9 18.7 4.7 1017.2 1014.5 24 12:40 10:35 6.1 0.8 6.2 1.5 15 2017 -13.9 -1 1 -1 341 336 336 309 331 358 336 318 310 318 10.0 255.0 7.5 4.4 51.97944 -4.34139 0.6 90 66 1.0 10:46 0.0 0.0 0.0 0.0 0.0 0.0 249.8 05:47 13:11 !!C10.37S13!!
I have used requests before and it worked like a charm. Unfortunately the assignment is to use the socket module. I think i know where the problem lies but not to solve it. I need to get rid of the HTTP code and information and just be able to read the file so i can print out the right part of it. Because at this very moment running this script only succeeds a couple of times because the text file is shifting and my script is focussing on:
integer = float(variable_1[46:50])
This part of the text file/string.
I hope you guys understand what I mean. My apologies in advance if this post has some flaws. Its my first one and I am fairly new to programming.
Thanks in advance.
HTTP Response seprates the header and content with a blank line.
So you can use
write.split('\r\n\r\n', 1)[1]
to get rid of the HTTP code and information, extract only the content of the response.
I am trying to scrape time series data using pandas DataFrame for Python 2.7 from the web page (http://owww.met.hu/eghajlat/eghajlati_adatsorok/bp/Navig/202_EN.htm). Could somebody please help me how I can write the code. Thanks!
I tried my code as follows:
html =urllib.urlopen("http://owww.met.hu/eghajlat/eghajlati_adatsorok/bp/Navig/202_EN.htm");
text= html.read();
df=pd.DataFrame(index=datum, columns=['m_ta','m_tax','m_taxd', 'm_tan','m_tand'])
But it doesn't give anything. Here I want to display the table as it is.
You can use BeautifulSoup for parsing all font tags, then split column a, set_index from column idx and rename_axis to None - remove index name:
import pandas as pd
import urllib
from bs4 import BeautifulSoup
html = urllib.urlopen("http://owww.met.hu/eghajlat/eghajlati_adatsorok/bp/Navig/202_EN.htm");
soup = BeautifulSoup(html)
#print soup
fontTags = soup.findAll('font')
#print fontTags
#get text from tags fonts
li = [x.text for x in soup.findAll('font')]
#remove first 13 tags, before not contain necessary data
df = pd.DataFrame(li[13:], columns=['a'])
#split data by arbitrary whitspace
df = df.a.str.split(r'\s+', expand=True)
#set column names
df.columns = columns=['idx','m_ta','m_tax','m_taxd', 'm_tan','m_tand']
#convert column idx to period
df['idx'] = pd.to_datetime(df['idx']).dt.to_period('M')
#convert columns to datetime
df['m_taxd'] = pd.to_datetime(df['m_taxd'])
df['m_tand'] = pd.to_datetime(df['m_tand'])
#set column idx to index, remove index name
df = df.set_index('idx').rename_axis(None)
print df
m_ta m_tax m_taxd m_tan m_tand
1901-01 -4.7 5.0 1901-01-23 -12.2 1901-01-10
1901-02 -2.1 3.5 1901-02-06 -7.9 1901-02-15
1901-03 5.8 13.5 1901-03-20 0.6 1901-03-01
1901-04 11.6 18.2 1901-04-10 7.4 1901-04-23
1901-05 16.8 22.5 1901-05-31 12.2 1901-05-05
1901-06 21.0 24.8 1901-06-03 14.6 1901-06-17
1901-07 22.4 27.4 1901-07-30 16.9 1901-07-04
1901-08 20.7 25.9 1901-08-01 14.7 1901-08-29
1901-09 15.9 19.9 1901-09-01 11.8 1901-09-09
1901-10 12.6 17.9 1901-10-04 8.3 1901-10-31
1901-11 4.7 11.1 1901-11-14 -0.2 1901-11-26
1901-12 4.2 8.4 1901-12-22 -1.4 1901-12-07
1902-01 3.4 7.5 1902-01-25 -2.2 1902-01-15
1902-02 2.8 6.6 1902-02-09 -2.8 1902-02-06
1902-03 5.3 13.3 1902-03-22 -3.5 1902-03-13
1902-04 10.5 15.8 1902-04-21 6.1 1902-04-08
1902-05 12.5 20.6 1902-05-31 8.5 1902-05-10
1902-06 18.5 23.8 1902-06-30 14.4 1902-06-19
1902-07 20.2 25.2 1902-07-01 15.5 1902-07-03
1902-08 21.1 25.4 1902-08-07 14.7 1902-08-13
1902-09 16.1 23.8 1902-09-05 9.5 1902-09-24
1902-10 10.8 15.4 1902-10-12 4.9 1902-10-25
1902-11 2.4 9.1 1902-11-01 -4.2 1902-11-18
1902-12 -3.1 7.2 1902-12-27 -17.6 1902-12-15
1903-01 -0.5 8.3 1903-01-11 -11.5 1903-01-23
1903-02 4.6 13.4 1903-02-23 -2.7 1903-02-17
1903-03 9.0 16.1 1903-03-28 4.9 1903-03-09
1903-04 9.0 16.5 1903-04-29 2.6 1903-04-19
1903-05 16.4 21.2 1903-05-03 11.3 1903-05-19
1903-06 19.0 23.1 1903-06-03 15.6 1903-06-07
... ... ... ... ... ...
1998-07 22.5 30.7 1998-07-23 15.0 1998-07-09
1998-08 22.3 30.5 1998-08-03 14.8 1998-08-29
1998-09 16.0 21.0 1998-09-12 10.4 1998-09-14
1998-10 11.9 17.2 1998-10-07 8.2 1998-10-27
1998-11 3.8 8.4 1998-11-05 -1.6 1998-11-21
1998-12 -1.6 6.2 1998-12-14 -8.2 1998-12-26
1999-01 0.6 4.7 1999-01-15 -4.8 1999-01-31
1999-02 1.5 6.9 1999-02-05 -4.8 1999-02-01
1999-03 8.2 15.5 1999-03-31 3.0 1999-03-16
1999-04 13.1 17.1 1999-04-16 6.1 1999-04-18
1999-05 17.2 25.2 1999-05-31 11.1 1999-05-06
1999-06 19.8 24.4 1999-06-07 12.2 1999-06-22
1999-07 22.3 28.0 1999-07-06 16.3 1999-07-23
1999-08 20.6 26.7 1999-08-09 17.3 1999-08-23
1999-09 19.3 22.9 1999-09-26 15.0 1999-09-02
1999-10 11.5 19.0 1999-10-03 5.7 1999-10-18
1999-11 3.9 12.6 1999-11-04 -2.2 1999-11-21
1999-12 1.3 6.4 1999-12-13 -8.1 1999-12-25
2000-01 -0.7 8.7 2000-01-31 -6.6 2000-01-25
2000-02 4.5 10.2 2000-02-01 -0.1 2000-02-23
2000-03 6.7 11.6 2000-03-09 0.6 2000-03-17
2000-04 14.8 22.1 2000-04-21 5.8 2000-04-09
2000-05 18.7 23.9 2000-05-27 12.3 2000-05-22
2000-06 21.9 29.3 2000-06-14 15.4 2000-06-17
2000-07 20.3 26.6 2000-07-03 14.0 2000-07-16
2000-08 23.8 29.7 2000-08-20 18.5 2000-08-31
2000-09 16.1 21.5 2000-09-14 12.7 2000-09-24
2000-10 14.1 18.7 2000-10-04 8.0 2000-10-23
2000-11 9.0 14.9 2000-11-15 3.7 2000-11-30
2000-12 3.0 9.4 2000-12-14 -6.8 2000-12-24
[1200 rows x 5 columns]
I am trying to select the following data using pandas for Python 2.7 from the web page (http://owww.met.hu/eghajlat/eghajlati_adatsorok/bp/Navig/202_EN.htm) starting from the year 1991 to 2000. somebody please can help me how I can write the code. Thanks!
datum m_ta m_tax m_taxd m_tan m_tand
------- ----- ----- ---------- ----- ----------
1901-01 -4.7 5.0 1901-01-23 -12.2 1901-01-10
1901-02 -2.1 3.5 1901-02-06 -7.9 1901-02-15
1901-03 5.8 13.5 1901-03-20 0.6 1901-03-01
1901-04 11.6 18.2 1901-04-10 7.4 1901-04-23
1901-05 16.8 22.5 1901-05-31 12.2 1901-05-05
1901-06 21.0 24.8 1901-06-03 14.6 1901-06-17
1901-07 22.4 27.4 1901-07-30 16.9 1901-07-04
1901-08 20.7 25.9 1901-08-01 14.7 1901-08-29
1901-09 15.9 19.9 1901-09-01 11.8 1901-09-09
1901-10 12.6 17.9 1901-10-04 8.3 1901-10-31
1901-11 4.7 11.1 1901-11-14 -0.2 1901-11-26
1901-12 4.2 8.4 1901-12-22 -1.4 1901-12-07
1902-01 3.4 7.5 1902-01-25 -2.2 1902-01-15
1902-02 2.8 6.6 1902-02-09 -2.8 1902-02-06
1902-03 5.3 13.3 1902-03-22 -3.5 1902-03-13
1902-04 10.5 15.8 1902-04-21 6.1 1902-04-08
1902-05 12.5 20.6 1902-05-31 8.5 1902-05-10
1902-06 18.5 23.8 1902-06-30 14.4 1902-06-19
....
You can use df.year with boolean indexing for selecting data by column datum:
#convert column datum to period
df['datum'] = pd.to_datetime(df['datum']).dt.to_period('M')
#convert columns to datetime
df['m_taxd'] = pd.to_datetime(df['m_taxd'])
df['m_tand'] = pd.to_datetime(df['m_tand'])
print df.datum.dt.year
0 1901
1 1901
2 1901
3 1901
4 1901
5 1901
6 1901
7 1901
8 1901
9 1901
10 1901
11 1901
12 1902
13 1902
14 1902
15 1902
16 1902
17 1902
Name: datum, dtype: int64
#change 1901 to 2000
print df[df.datum.dt.year <= 1901]
datum m_ta m_tax m_taxd m_tan m_tand
0 1901-01 -4.7 5.0 1901-01-23 -12.2 1901-01-10
1 1901-02 -2.1 3.5 1901-02-06 -7.9 1901-02-15
2 1901-03 5.8 13.5 1901-03-20 0.6 1901-03-01
3 1901-04 11.6 18.2 1901-04-10 7.4 1901-04-23
4 1901-05 16.8 22.5 1901-05-31 12.2 1901-05-05
5 1901-06 21.0 24.8 1901-06-03 14.6 1901-06-17
6 1901-07 22.4 27.4 1901-07-30 16.9 1901-07-04
7 1901-08 20.7 25.9 1901-08-01 14.7 1901-08-29
8 1901-09 15.9 19.9 1901-09-01 11.8 1901-09-09
9 1901-10 12.6 17.9 1901-10-04 8.3 1901-10-31
10 1901-11 4.7 11.1 1901-11-14 -0.2 1901-11-26
11 1901-12 4.2 8.4 1901-12-22 -1.4 1901-12-07
I have two data frames as such:
df1 = pd.DataFrame({ 'pressure' : [42,42,42,42,42,42,42,36,36,36,36,36,36,36],
'load' : [350,350,350,350,350,350,350,700,700,700,700,700,700,700],
'speed' : [70,60,50,40,30,20,10,70,60,50,40,30,20,10],
'lforce' : [3.6,3.5,3.3,3.2,3.1,3.1,2.9,7.7,7.3,7.0,6.8,6.5,6.4,6.1],
'rforce' : [3.4,3.2,3.1,3.0,2.9,2.8,2.7,7.6,7.2,6.9,6.6,6.3,6.2,5.9]
}).set_index(['pressure','load','speed'])
df2 = pd.DataFrame({ 'pressure' : [47,47,47,47,47,47,47],
'load' : [20,20,20,20,20,20,20],
'speed' : [70,60,50,40,30,20,10],
'lforce' : [2.5,2.1,1.9,1.7,1.5,1.3,1.2],
'rforce' : [2.8,2.6,2.4,2.2,2.0,1.8,1.7]
}).set_index(['pressure','load','speed'])
Formatted:
>>> df1
lforce rforce
pressure load speed
42 350 70 3.6 3.4
60 3.5 3.2
50 3.3 3.1
40 3.2 3.0
30 3.1 2.9
20 3.1 2.8
10 2.9 2.7
36 700 70 7.7 7.6
60 7.3 7.2
50 7.0 6.9
40 6.8 6.6
30 6.5 6.3
20 6.4 6.2
10 6.1 5.9
>>> df2
lforce rforce
pressure load speed
47 20 70 2.5 2.8
60 2.1 2.6
50 1.9 2.4
40 1.7 2.2
30 1.5 2.0
20 1.3 1.8
10 1.2 1.7
I would like to subtract df2 from df1 on the lforce and rforce columns for each speed to get the resulting data frame df3.
My problem is that I need to ignore the pressure and load in df2 during the subtraction, but retain the originals from df1.
Desired result:
>>> df3
lforce rforce
pressure load speed
42 350 70 1.1 0.6
60 1.3 0.6
50 1.4 0.7
40 1.5 0.8
30 1.6 0.9
20 1.7 1.0
10 1.7 1.0
36 700 70 5.2 4.8
60 5.1 4.6
50 5.1 4.4
40 5.1 4.4
30 5.0 4.3
20 5.0 4.3
10 4.9 4.2
df1.sub(df2.reset_index([0, 1], drop=True), level=2)
output:
lforce rforce
pressure load speed
42 350 70 1.1 0.6
60 1.4 0.6
50 1.4 0.7
40 1.5 0.8
30 1.6 0.9
20 1.8 1.0
10 1.7 1.0
36 700 70 5.2 4.8
60 5.2 4.6
50 5.1 4.5
40 5.1 4.4
30 5.0 4.3
20 5.1 4.4
10 4.9 4.2
May be somehing like this:
>>> df3 = df1.reset_index(level=[0,1])
>>> df4 = df2.reset_index(level=[0,1])
>>> df4['pressure'] = 0
>>> df4['load'] = 0
>>> df3 - df4
pressure load lforce rforce
speed
10 42 350 1.7 1.0
10 36 700 4.9 4.2
20 42 350 1.8 1.0
20 36 700 5.1 4.4
30 42 350 1.6 0.9
30 36 700 5.0 4.3
40 42 350 1.5 0.8
40 36 700 5.1 4.4
50 42 350 1.4 0.7
50 36 700 5.1 4.5
60 42 350 1.4 0.6
60 36 700 5.2 4.6
70 42 350 1.1 0.6
70 36 700 5.2 4.8
Now you just have to move pressure and load back to index
Is this what you're looking for?
d1 = df1.reset_index(['pressure','load'])
d2 = df2.reset_index(['pressure','load'])
r0 = d1.merge(d2, left_index=True, right_index=True)
r1 = r0.set_index(['pressure_x','load_x'], drop=False)
r1['lforce'] = r1.lforce_x - r1.lforce_y
r1['rforce'] = r1.rforce_x - r1.rforce_y
df3 = r1[['lforce','rforce']]
df3
One:
data have;
input x1 x2;
diff=x1-x2;
a_diff= round(abs(diff), .01);
* a_diff=abs(diff);
cards;
50.7 60
3.3 3.3
28.8 30
46.2 43.2
1.2 2.2
25.5 27.5
2.9 4.9
5.4 5
3.8 3.2
1 4
;
run;
proc rank data =have out =have_r;
where diff;
var a_diff ;
ranks a_diff_r;
run;
proc print data =have_r;run;
Results:
Obs x1 x2 diff a_diff a_diff_r
1 50.7 60.0 -9.3 9.3 9.0
2 28.8 30.0 -1.2 1.2 4.0
3 46.2 43.2 3.0 3.0 7.5
4 1.2 2.2 -1.0 1.0 3.0
5 25.5 27.5 -2.0 2.0 5.5
6 2.9 4.9 -2.0 2.0 5.5
7 5.4 5.0 0.4 0.4 1.0
8 3.8 3.2 0.6 0.6 2.0
9 1.0 4.0 -3.0 3.0 7.5
Two:
data have;
input x1 x2;
diff=x1-x2;
a_diff=abs(diff);
cards;
50.7 60
3.3 3.3
28.8 30
46.2 43.2
1.2 2.2
25.5 27.5
2.9 4.9
5.4 5
3.8 3.2
1 4
;
run;
proc rank data =have out =have_r;
where diff;
var a_diff ;
ranks a_diff_r;
run;
proc print data =have_r;run;
results:
Obs x1 x2 diff a_diff a_diff_r
1 50.7 60.0 -9.3 9.3 9.0
2 28.8 30.0 -1.2 1.2 4.0
3 46.2 43.2 3.0 3.0 7.5
4 1.2 2.2 -1.0 1.0 3.0
5 25.5 27.5 -2.0 2.0 5.0
6 2.9 4.9 -2.0 2.0 6.0
7 5.4 5.0 0.4 0.4 1.0
8 3.8 3.2 0.6 0.6 2.0
9 1.0 4.0 -3.0 3.0 7.5
Attention Please,Obs 3,9,5,6, why ranks were different? Thank you!
Run the code below and you'll see that they are actually different. That's because of inaccuracies in numeric storage; similar to how 1/3 is not representable in decimal notation (0.333333333333333 etc.) and 1-(1/3)-(1/3)-(1/3) is not equal to zero if you use, say, ten digits to store each result as you go (it is equal to 0.000000001, then), any computer system will have some issues with certain numbers that while in decimal (base 10) appear to store nicely, in binary do not.
The solution here is basically to round as you are, or to fuzz the result which amounts to the same thing (it ignores differences less than 1x10^-12).
data have;
input x1 x2;
diff=x1-x2;
a_diff=abs(diff);
put a_diff= hex16.;
cards;
50.7 60
3.3 3.3
28.8 30
46.2 43.2
1.2 2.2
25.5 27.5
2.9 4.9
5.4 5
3.8 3.2
1 4
;
run;