Python urllib2 - Freezes when connection temporarily dies - python-2.7

So, I'm working with urllib2, and it keeps freezing on a specific page. Not even Ctrl-C will cancel the operation. It's throwing no errors (I'm catching everything), and I can't figure out how to break it. Is there a timeout option for urllib2 that defaults to never?
Here's the procedure:
req = urllib2.Request(url,headers={'User-Agent':'...<chrome's user agent string>...'})
page = urllib2.urlopen(req)
// p.s. I'm not installing any openers
Then, if the internet gets cut partway through the second line (which downloads it), even if connection is restored, this freezes the program completely.
Here's the response header I get in my browser (Chrome) from the same page:
HTTP/1.1 200 OK
Date: Wed, 15 Feb 2017 18:12:12 GMT
Content-Type: application/rss+xml; charset=UTF-8
Content-Length: 247377
Connection: keep-alive
ETag: "00e0dd2d7cab7cffeca0b46775e1be7e"
X-Robots-Tag: noindex, follow
Link: ; rel="https://api.w.org/"
Content-Encoding: gzip
Vary: Accept-Encoding
Cache-Control: max-age=600, private, must-revalidate
Expires: Wed, 15 Feb 2017 18:12:07 GMT
X-Cacheable: NO:Not Cacheable
Accept-Ranges: bytes
X-Served-From-Cache: Yes
Server: cloudflare-nginx
CF-RAY: 331ab9e1443656d5-IAD
p.s. The url is to a large WordPress feed which, according to the response, appears compressed.

According to the docs, the default timeout is, indeed, no timeout. You can specify a timeout when calling urlopen though. :)
page = urllib2.urlopen(req, timeout=30)

Related

BigQuery upload job returning errors - payload parts count wrong?

We are experiencing upload errors to BigQuery / cloud storage:
REQUEST
POST https://www.googleapis.com/upload/bigquery/v2/projects/XXX HTTP/1.1
Content-Type: multipart/related; boundary="PART_TAG_DATA_IMPORTER"
Host: www.googleapis.com
Content-Length: 652
--PART_TAG_DATA_IMPORTER
Content-Type: application/json; charset=UTF-8
{"configuration":{"load":{"createDisposition":"CREATE_IF_NEEDED","destinationTable":{"datasetId":"XX","projectId":"XX","tableId":"XX"},"schema":{"fields":[{"mode":"required","name":"xx1","type":"INTEGER"},{"mode":"required","name":"xx2","type":"STRING"},{"mode":"required","name":"xx3","type":"INTEGER"}]},"skipLeadingRows":1,"sourceFormat":"CSV","sourceUris":["gs://XXX/9f41d369-b63e-4858-9108-7d1243175955.csv"],"writeDisposition":"WRITE_TRUNCATE"}}}
--PART_TAG_DATA_IMPORTER--
RESPONSE:
HTTP/1.1 400 Bad Request
X-GUploader-UploadID: XXX
Content-Length: 77
Date: Fri, 15 Nov 2019 10:23:33 GMT
Server: UploadServer
Content-Type: text/html; charset=UTF-8
Alt-Svc: quic=":443"; ma=2592000; v="46,43",h3-Q050=":443"; ma=2592000,h3-Q049=":443"; ma=2592000,h3-Q048=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000
Payload parts count different from expected 2. Request payload parts count: 1
Anyone else receiving this? Everything worked fine since last night. There were no changes in our codebase and error is happening in about 80% of the cases but after 5-6 attempts it (sometimes) goes through.
We are using .NET and have the latest Google.Apis libraries but this is reproducible by simple request to the server. It also sometimes goes through normally.
Google has added check in /upload/bigquery/v2/projects/{projectId}/jobs endpoint a rule that it cannot receive single part message.
/bigquery/v2/projects/{projectId}/jobs needs to be used when doing upload from GCS as per this documentation here (which does not say this explicitly):
https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/insert
This looks quite odd. It appears you're using the inline upload endpoint but you're passing a reference to a GCS object in the load config, and not sending an inline upload.
Could you share a snippet of how you're constructing this from the .NET code?

PowerBI live dashboard not updating in real time from Rest API

I've got a small simple console app pushing data into a PowerBI dataset. The data is going in, but the dashboard does not appear to be updating in real time.
If I manually refresh the dashboard I can see the latest data, but it does not automatically update when I add rows to the table.
I've got a fiddler output of the request/response so I can see data is going across.
POST https://api.powerbi.com/v1.0/myorg/datasets/e6373821-c2ed-438a-967a-febe163dca75/tables/LiveCpu/rows HTTP/1.1
Connection: Keep-Alive
Authorization: Bearer xxxx
Content-Type: application/json; charset=utf-8
Host: api.powerbi.com
Content-Length: 65
Expect: 100-continue
{"rows":[{"Timestamp":"2016-04-29T11:49:01","Value":31.8878784}]}
The response back is
HTTP/1.1 200 OK
Cache-Control: no-store, must-revalidate, no-cache
Transfer-Encoding: chunked
Content-Type: application/octet-stream
Server: Microsoft-HTTPAPI/2.0,Microsoft-HTTPAPI/2.0 Microsoft-HTTPAPI/2.0
Strict-Transport-Security: max-age=31536000; includeSubDomains
X-Frame-Options: deny
X-Content-Type-Options: nosniff
RequestId: 9daaabb9-e76d-4684-8ed3-1f6dc37889ab
Date: Fri, 29 Apr 2016 10:48:59 GMT
0
So all looks ok, but the live dashboard is not updating. I can even see messages in the web browser developer tools showing the request id has gone through, but no live updates.
It appears the problem was that I pinned an entire report to a dashboard rather than an individual report tile. Single report tiles do not appear to support automatic refresh.

MISS from Cloudfront after HIT from Cloudfront

I am switching to Amazon Cloudfront for serving images on my website. To reduce load when we finally make it live, I thought of warming up the cache by hitting image URLs (I am making these request from India and expect majority of users to request from the same region so no need to have a copy of object on all edge locations worldwide).
The problem is that script uses curl to request image and when I access the same URL in browser I get MISS from Cloudfront. So Cloudfront is making two copies of object for these two request.
My current Cloudfront configuration forwards Content-Type request Header to origin.
How should I configure Cloudfront so that it doesn't care about request headers at all and once I made a request (whether curl or using browser) it should serve all future request for same resource from edge and not origin.
Request/Response headers-
I am afraid that the Cloudfront url won't be accessible from outside (until we go live) but I am posting request/response headers, this should give you fair idea. Also you can check out caching headers at origin - https://origin.ixigo.com/image/upload/t_thumb,f_auto/r7y6ykuajvlumkp4lk2a.jpg
Response after two successive request using browser
Remote Address:54.230.156.66:443
Request URL:https://youcannotaccess.com/image/upload/t_thumb,f_auto/r7y6ykuajvlumkp4lk2a.jpg
Request Method:GET
Status Code:200 OK
Response Headers
view source
Accept-Ranges:bytes
Age:23
Cache-Control:public, max-age=31557600
Connection:keep-alive
Content-Length:8708
Content-Type:image/jpg
Date:Fri, 27 Nov 2015 09:16:03 GMT
ETag:"-170562206"
Last-Modified:Sun, 29 Jun 2014 03:44:59 GMT
Vary:Accept-Encoding
Via:1.1 7968275877e438c758292828c0593684.cloudfront.net (CloudFront)
X-Amz-Cf-Id:fcbGLv8uBOP89qfR52OWa-NlqWkEREJPpZpy9ix0jdq8-a4oTx7lNw==
X-Backend:image6_40
X-Cache:Hit from cloudfront
X-Cache-Hits:0
X-Device:pc
X-DeviceType:pc
X-Powered-By:xyz
Now same url requested using curl but gave me miss
curl manu-mdc:cache manuc$ curl -I https://youcannotaccess.com/image/upload/t_thumb,f_auto/r7y6ykuajvlumkp4lk2a.jpg
HTTP/1.1 200 OK
Content-Type: image/jpg
Content-Length: 8708
Connection: keep-alive
Age: 0
Cache-Control: public, max-age=31557600
Date: Fri, 27 Nov 2015 09:16:47 GMT
ETag: "-170562206"
Last-Modified: Sun, 29 Jun 2014 03:44:59 GMT
X-Backend: image6_40
X-Cache-Hits: 0
X-Device: pc
X-DeviceType: pc
X-Powered-By: xyz
Vary: Accept-Encoding
X-Cache: Miss from cloudfront
Via: 1.1 4d42171c56a4c8b5c627040e6aa0938d.cloudfront.net (CloudFront)
X-Amz-Cf-Id: fY0LXhp7NlqB-I8F5-1TIMnA6bONjPD3CEp7dsyVdykP-7N2mbffvw==
Now this will give HIT
manu-mdc:cache manuc$ curl -I https://youcannotaccess.com/image/upload/t_thumb,f_auto/r7y6ykuajvlumkp4lk2a.jpg
HTTP/1.1 200 OK
Content-Type: image/jpg
Content-Length: 8708
Connection: keep-alive
Cache-Control: public, max-age=31557600
Date: Fri, 27 Nov 2015 09:16:47 GMT
ETag: "-170562206"
Last-Modified: Sun, 29 Jun 2014 03:44:59 GMT
X-Backend: image6_40
X-Cache-Hits: 0
X-Device: pc
X-DeviceType: pc
X-Powered-By: xyz
Age: 3
Vary: Accept-Encoding
X-Cache: Hit from cloudfront
Via: 1.1 6877899d48ba844a34ea4378ce336f06.cloudfront.net (CloudFront)
X-Amz-Cf-Id: qpPhbLX_5t2Xj0XZuZdjWD2w-BI80DUVyL496meQkLfSEn3ikt7hNg==
This is similar to this issue: Why are two requests with different clients from the same computer cache misses on cloudfront?
Depending on whether you provide the "Accept-Encoding: gzip" header or not, CloudFront edge server caches the object separately. Since browsers provides this header by default, and your site is likely to be accessed majorly via browser, I will suggest changing your curl call to include this header.
I was facing the same problem, after making the change in my curl call, I started to get a Hit from the browser on my first try via browser (after making a curl call).
Another thing I noticed is that CloudFront requires the full requested object to be downloaded before it will be cached. If you try to download the file partially by specifying the byte range in the curl, the intended object does not get cached, only the downloaded part gets cached as a different object. Same goes for a curl that was terminated in between. The other options I tried were wget call with spider option, but that internally does a HEAD call only and thus does not get the content cached on the edge server.

How to find out where a cookie is set?

I am trying to find out where a cookie is being set.
I am running Varnish cache and want to know where the cookie is being set so I know if I can safely remove it for caching purposes.
The response headers look like this;
HTTP/1.1 200 OK
Server: Apache/2.2.17 (Ubuntu)
Expires: Mon, 05 Dec 2011 15:11:39 GMT
Cache-Control: no-store, max-age=7200
Vary: Accept-Encoding
Content-Type: text/html; charset=UTF-8
X-Session: NO
X-Cacheable: YES
Date: Tue, 04 Dec 2012 15:29:40 GMT
X-Varnish: 1233768756 1233766580
Age: 1081
Via: 1.1 varnish
Connection: keep-alive
X-Cache: HIT
There is no cookie present. But when loading the same page in a browser the headers are the same, I get a cache hit and no cookie in the response headers.
But then the cookie is there all of a sudden, so it must be being somewhere. Even if I remove it it reappears. It even appears in Incognito mode in Chrome. But it is not in the header response.
I have been through all the javascript on the site and cannot find anything, is there any other way of setting a cookie?
Thanks.
If the Set-Cookie header goes through Varnish at some point, you can use varnishlog to find the request URL:
$ varnishlog -b -m 'RxHeader:Set-Cookie.*COOKIENAME'
This will give you a full varnishlog listing for the backend requests, including the TxURL to the backend which tells you what the client asked for when it got Set-Cookie back.

Is it valid to respond to an HTTP 1.1 request with an HTTP 1.0 response?

I am setting up video delivery for video files to tv set-top boxes.
I want to use Amazon Cloudfront.
The video files are requested as usual http requets that may contain a range header to request partial resources (to enable the user on the box to jump into any position within the video).
My problem is that it is working on 2 of 3 boxes, one makes problems.
The request looks like this (sample data):
GET /path/file.mp4 HTTP/1.1
User-Agent: My User Agent
Host:myhost.com
Accept:*/*
Range: bytes=100-200
So if I do a request to cloudfront using telnet I see that the response is HTTP 1.0:
joe#flimmit-joe:~$ telnet d2zf9fl0izzsf6.cloudfront.net 80
Trying 216.137.61.164...
Connected to d2zf9fl0izzsf6.cloudfront.net.
Escape character is '^]'.
GET /skin/frontend/default/flimmit/images/headerbanners/02_green.png HTTP/1.1
User-Agent: My User Agent
Host:d2zf9fl0izzsf6.cloudfront.net
Accept:*/*
Range: bytes=100-200
HTTP/1.0 206 Partial Content
Date: Sun, 12 Feb 2012 18:42:15 GMT
Server: Apache/2.2.16 (Ubuntu)
Last-Modified: Tue, 26 Jul 2011 10:37:54 GMT
ETag: "1e0b8a-2d2b-4a8f6863ac11a"
Accept-Ranges: bytes
Cache-Control: max-age=2592000
Expires: Tue, 13 Mar 2012 18:42:15 GMT
Content-Type: image/png
Age: 351213
Content-Range: bytes 100-200/11563
Content-Length: 101
X-Cache: Hit from cloudfront
X-Amz-Cf-Id: W2fzPeBSWb8_Ha_UzvIepZH-Z9xibXyRddoHslJZ3TDXyFfjwE3UMQ==,CwiKc8-JGfE77KBVTTOyE9g-OYf7P-bCJZEWGwef9Es5rzhUBYKE8A==
Via: 1.0 972e3ba2f91fd0a38ea062d0cc03be37.cloudfront.net (CloudFront)
Connection: close
q�]#��ĥM�oӘ�i��i��������Y�.��/��ib���&
���
�Ⱦ�00�>�����Y`��X���r���s�=�n�s�b���7MConnection closed by foreign host.
joe#flimmit-joe:~$
Unfortunately I have only limited access to the box for testing purposes.
However this behavior by cloud-front seems strange to me so I wanted to ask whether it is even valid.
It is absolutely "valid" to answer with Http 1.0 to an Http 1.1 request.
I'll cite the Appendix 19.6 to the RFC2068 "It is beyond the scope of a protocol specification to mandate compliance with previous versions. HTTP/1.1 was deliberately designed, however, to make supporting previous versions easy."
http://www.w3.org/Protocols/rfc2616/rfc2616-sec19.html#sec19.6
The important part is basically that the RFC does not force an Http 1.1 answer, so it's up to the server.