web scraper with HTTP Error 503: Service Unavailable - python-2.7

I am trying to build a scraper, but I keep getting the 503 blocking error. I can still access the website manually, so my IP address hasn't been blocked. I keep switching user agents and still can't get my code to run all the way through. Sometimes I get up to 15, sometimes I don't get any, but it always fails eventually. I have no doubt that I'm doing something wrong in my code. I did shave it down to fit, though, so please keep that in mind. How do I fix this without using third parties?
import requests
import urllib2
from urllib2 import urlopen
import random
from contextlib import closing
from bs4 import BeautifulSoup
import ssl
import parser
import time
from time import sleep
def Parser(urls):
randomint = random.randint(0, 2)
randomtime = random.randint(5, 30)
url = "https://www.website.com"
user_agents = [
"Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)",
"Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)",
"Opera/9.80 (Windows NT 6.1; U; cs) Presto/2.2.15 Version/10.00"
]
index = 0
opener = urllib2.build_opener()
req = opener.addheaders = [('User-agent', user_agents[randomint])]
def ReadUPC():
UPCList = [
'upc',
'upc2',
'upc3',
'upc4',
'etc.'
]
extracted_data = []
for i in UPCList:
urls = "https://www.website.com" + i
randomtime = random.randint(5, 30)
Soup = BeautifulSoup(urlopen(urls), "lxml")
price = Soup.find("span", { "class": "a-size-base a-color-price s-price a-text-bold"})
sleep(randomtime)
randomt = random.randint(5, 15)
print "ref url:", urls
sleep(randomt)
print "Our price:",price
sleep(randomtime)
if __name__ == "__main__":
ReadUPC()
index = index + 1
sleep(10)
554 class HTTPDefaultErrorHandler(BaseHandler):
555 def http_error_default(self, req, fp, code, msg, hdrs):
556 raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
557
558 class HTTPRedirectHandler(BaseHandler):
HTTPError: HTTP Error 503: Service Unavailable

What website you are scraping? most websites uses cookies to recognize the user as well. Please enable cookies in your code.
Also open that link in browser and along with Firebug and see Headers being sent to server by your browser while making request. and then try to fake all those headers.
PS:
In my view, sending random user-agent strings from SAME IP wont make any difference, unless you are rotating IPs.

Behave like a normal human being using a browser. That website seems to be designed to analyze your behaviour and sees that you're a scraper, and wants to block you; in the easiest case, a minimal JavaScript that changes link URLs on the fly would be enough to disable "dumb" scrapers.
There's elegant ways to solve this dilemma, for example by instrumenting a browser, but that won't happen without external tools.

Related

HTTP response codes coming wrongly where it is actually 200

I am trying to extract the HTTP links from an XML. Then trying to get the http response code for the same. But interestingly, i am getting either 500 or 400. If i click on the url, i will get the image properly in the browser.
My Code is:
def extract_src_link(path):
with open(path, 'r') as myfile:
for line in myfile:
if "src" in line:
src_link = re.search('src=(.+?)ptype="2"', line)
url = src_link.group(1)
url = url[1:-1]
#print ("url:", url)
resp = requests.head(url)
print(resp.status_code)
Not sure whats happening here. This is how my output looks like
/usr/local/bin/python2.7
/Users/rradhakrishnan/Projects/eVision/Scripts/xml_validator_ver3.py
Processing:
/Users/rradhakrishnan/rradhakrishnan1/mobily/E30000001554522119_2020_01_27T17_35_40Z.xml
500
404
Processing:
/Users/rradhakrishnan/rradhakrishnan1/mobily/E30000001557496079_2020_01_27T17_35_40Z.xml
500
404
This is how my output looks like:
I somehow managed to crack it down. Adding the User Agent did resolve the issue.
def extract_src_link(path):
with open(path, 'r') as myfile:
for line in myfile:
if "src" in line:
src_link = re.search('src=(.+?)ptype="2"', line)
url = src_link.group(1)
url = url[1:-1]
print ("url:", url)
# resp = requests.head(url)
# print(resp.status_code)
headers ={'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36'}
r = requests.get('http://www.booking.com/reviewlist.html?cc1=tr;pagename=sapphire', headers=headers)
print r.status_code

Django Rest Framework returns bad request when POSTED a file by filepond on React

I have a react app, that uses filepond. Filepond accepts a file, and POSTs it to the server using the following custom header:
const filepondServer = {
url: `${apiRoot}expenses/receipt_upload`,
process: {
headers: {
Authorization: `Token ${this.props.auth.token}`
}
}
};
This goes to a django rest framework view:
class ExpenseReceiptUploadView(APIView):
permission_classes = [permissions.IsAuthenticated, HasMetis]
parser_classes = (FileUploadParser,)
def post(self, request):
receipt = request.data["file"]
return Response(status=status.HTTP_201_CREATED)
(I know it needs fleshing out for errors etc, but that will come once it works)
This returns a 400 error, with no further details. If I remove the receipt = request.data["file"] line, it returns a 201, and the server doesn't complain.
To debug this, I tried printing request - this works fine, but request.data results in a 400, as does request.FILES.
The error is very terse, it just says:
2018-12-21 00:01:35,850 [middlewares 70] INFO: {"method": "POST", "path": "/api/v1/operations/expenses/receipt_upload", "user": "Alex", "user_id": 27192835, "device_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36", "request_post_body": {"filepond": "{}"}}
2018-12-21 00:01:35,851 [log 228] WARNING: Bad Request: /api/v1/operations/expenses/receipt_upload
[21/Dec/2018 00:01:35] "POST /api/v1/operations/expenses/receipt_upload HTTP/1.1" 400 0
author of FilePond here.
FilePond will also post the metadata object of the file using the same field name. This works fine on PHP but I'm not sure if this might be troublesome on other backends. I think that's what the error below is trying to indicate.
"request_post_body": {"filepond": "{}"}
In versions until FilePond 3.7 the metadata would be posted first. I've swapped this around in 3.7, now the file is posted first, so I'm wondering if you're using 3.7 or an earlier version and if this makes any difference.

selenium with chromedriver on centOS7 for spidering

I trying to make Crawler for my server.
I Found chilkat Lib's CKSpider, but it is not support JS Rendering.
So I try to use selenium webdriver with Chrome.
I run with CentOS7, python2.7
I want spider all page with 1 baseDomain.
Example
BaseDomain = example.com
then find all page something like
example.com/event/.../../...
example.com/games/.../...
example.com/../.../..
...
My Crawler code
from selenium import webdriver
import time
options = webdriver.ChromeOptions()
options.binary_location = "/usr/bin/google-chrome"
chrome_driver_binary = "/root/chromedriver"
options.add_argument("--headless")
options.add_argument("user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36")
options.add_argument("lang=ko-KR,ko,en-US,en")
options.add_argument("--window-size=1920x1080")
options.add_argument("--disable-gpu")
options.add_argument("--no-sandbox")
options.add_argument("--disable-extensions")
driver = webdriver.Chrome(chrome_driver_binary, chrome_options=options)
host = example.com
def Crawler(Url):
driver.get(Url)
driver.implicitly_wait(3)
#Do Something
time.sleep(3)
#Crawl next
Crawler(host)
driver.quit()
How can I crawl next page? Is there any other way in selenium
Or need other Lib for that?
Thanks for any Tips or Advice.

python SSLError("bad handshake: SysCallError(-1, 'Unexpected EOF')",),))

I was scraping this aspx website https://gra206.aca.ntu.edu.tw/Temp/W2.aspx?Type=2 .
As it required, I have to parse in __VIEWSTATE and __EVENTVALIDATION while sending a post request. Now I am trying to send a get request first to have those two values, and then parse then afterward.
However, I have tried several times to send a get request. It always turns out throwing this error message:
requests.exceptions.SSLError: HTTPSConnectionPool(host='gra206.aca.ntu.edu.tw', port=443): Max retries exceeded with url: /Temp/W2.aspx?Type=2 (Caused by SSLError(SSLError("bad handshake: SysCallError(-1, 'Unexpected EOF')",),))
I have tried:
upgrade OpenSSL
download requests[security]
However, none of them works.
I am currently using:
env:
python 2.7
bs4 4.6.0
request 2.18.4
openssl 1.0.2n
Here is my code:
import requests
from bs4 import BeautifulSoup
with requests.Session() as s:
s.auth = ('user', 'pass')
s.headers.update({'x-test': 'true'})
url = 'https://gra206.aca.ntu.edu.tw/Temp/W2.aspx?Type=2'
r = s.get(url, headers={'x-test2': 'true'})
soup = BeautifulSoup(r.content, 'lxml')
viewstate = soup.find('input', {'id': '__VIEWSTATE' })['value']
validation = soup.find('input', {'id': '__EVENTVALIDATION' })['value']
print viewstate, generator, validation
I am also looking for a solution for it. Some sites have deprecated TLSv1.0 and Requests + Openssl (on Windows 7) has trouble to build handshake with such peer host. Wireshark log showed the TLSv1 Client Hello was issued by the client but the host did not answer correctly. This error propagated up as the error message Requests showed. Even with the most updated Openssl/pyOpenssl/Requests and tried on Py3.6/2.7.12, no luck. Intrestingly when I replace the url to other like "google.com", the log showed TLSv1.2 Hello was issued and responded by the host. Please check images tlsv1 and
tlsv1.2.
Clearly the client has TLSv1.2 capability but why it use v1.0 Hello in the former case?
[EDIT]
I was wrong in previous statement. Wireshark misinterpreted unfinished TLSv1.2 HELLO exchanged as TLSv1. After more digging into it, I found these hosts is expecting pure TLSv1, but not a TLSv1 fallback from TLSv1.2. Due to Openssl's lack of some fields in the Hello extension fields (maybe Supported Version) when compared with the log from Chrome. I found a workaround to it. 1. Force the use of TLSv1 negotiation. 2. Change the default cipher suite to py3.4 style to re-enable 3DES.
import ssl
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.poolmanager import PoolManager
#from urllib3.poolmanager import PoolManager
from requests.packages.urllib3.util.ssl_ import create_urllib3_context
# py3.4 default
CIPHERS = (
'ECDH+AESGCM:DH+AESGCM:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:ECDH+HIGH:'
'DH+HIGH:ECDH+3DES:DH+3DES:RSA+AESGCM:RSA+AES:RSA+HIGH:RSA+3DES:!aNULL:'
'!eNULL:!MD5'
)
class DESAdapter(HTTPAdapter):
"""
A TransportAdapter that re-enables 3DES support in Requests.
"""
def create_ssl_context(self):
#ctx = create_urllib3_context(ciphers=FORCED_CIPHERS)
ctx = ssl.create_default_context()
# allow TLS 1.0 and TLS 1.2 and later (disable SSLv3 and SSLv2)
#ctx.options |= ssl.OP_NO_SSLv2
#ctx.options |= ssl.OP_NO_SSLv3
#ctx.options |= ssl.OP_NO_TLSv1
ctx.options |= ssl.OP_NO_TLSv1_2
ctx.options |= ssl.OP_NO_TLSv1_1
#ctx.options |= ssl.OP_NO_TLSv1_3
ctx.set_ciphers( CIPHERS )
#ctx.set_alpn_protocols(['http/1.1', 'spdy/2'])
return ctx
def init_poolmanager(self, *args, **kwargs):
context = create_urllib3_context(ciphers=CIPHERS)
kwargs['ssl_context'] = self.create_ssl_context()
return super(DESAdapter, self).init_poolmanager(*args, **kwargs)
def proxy_manager_for(self, *args, **kwargs):
context = create_urllib3_context(ciphers=CIPHERS)
kwargs['ssl_context'] = self.create_ssl_context()
return super(DESAdapter, self).proxy_manager_for(*args, **kwargs)
tmoval=10
proxies={}
hdr = {'Accept-Language':'zh-TW,zh;q=0.8,en-US;q=0.6,en;q=0.4', 'Cache-Control':'max-age=0', 'Connection':'keep-alive', 'Proxy-Connection':'keep-alive', #'Cache-Control':'no-cache', 'Connection':'close',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.85 Safari/537.36',
'Accept-Encoding':'gzip,deflate,sdch','Accept':'*/*'}
ses = requests.session()
ses.mount(url, DESAdapter())
response = ses.get(url, timeout=tmoval, headers = hdr, proxies=proxies)
[EDIT2]
When your HTTPS url contains any uppercase letter, the patch would fail to work. You need to reverse them to lowercase. Something unknown in the stack requests/urllib3/openssl cause the patch logic being restored to its default TLS1.2 fashion.
[EDIT3]
from http://docs.python-requests.org/en/master/user/advanced/
The mount call registers a specific instance of a Transport Adapter to a prefix. Once mounted, any HTTP request made using that session whose URL starts with the given prefix will use the given Transport Adapter.
So, to make all HTTPS requests include those redirected by the server afterwards to use the new adapter, must change this line to:
ses.mount('https://', DESAdapter())
Somehow it fixed the uppercase problem mentioned above.

Mechanize - Python

I am using mechanize in python to log into a HTTPS page. The login is successful but the output is just a SAML response. I am unable to get the actual page source which i get when opening with my browser.
import mechanize
import getpass
import cookielib
br=mechanize.Browser()
br.set_handle_robots(False)
b=[]
cj = cookielib.CookieJar()
br.set_cookiejar(cj)
pw=getpass.getpass("Enter Your Password Here: ")
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent','Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11'),
('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'),
('Accept-Encoding', 'gzip,deflate,sdch'),
('Accept-Language', 'en-US,en;q=0.8'),
('Accept-Charset', 'ISO-8859-1,utf-8;q=0.7,*;q=0.3')]
br.open("https:***single sign on login url***")
br.select_form(name='login-form')
br.form['userid']='id'
br.form['password']=pw
response=br.submit()
print response.read()
a=br.open("https:****url****")
for i in range(1000):
b.append(a.readline())
print b
I get SAML output which is encrypted but i dont know how to reply with that SAML post to get to the actual page.