Mechanize - Python - python-2.7

I am using mechanize in python to log into a HTTPS page. The login is successful but the output is just a SAML response. I am unable to get the actual page source which i get when opening with my browser.
import mechanize
import getpass
import cookielib
br=mechanize.Browser()
br.set_handle_robots(False)
b=[]
cj = cookielib.CookieJar()
br.set_cookiejar(cj)
pw=getpass.getpass("Enter Your Password Here: ")
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent','Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11'),
('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'),
('Accept-Encoding', 'gzip,deflate,sdch'),
('Accept-Language', 'en-US,en;q=0.8'),
('Accept-Charset', 'ISO-8859-1,utf-8;q=0.7,*;q=0.3')]
br.open("https:***single sign on login url***")
br.select_form(name='login-form')
br.form['userid']='id'
br.form['password']=pw
response=br.submit()
print response.read()
a=br.open("https:****url****")
for i in range(1000):
b.append(a.readline())
print b
I get SAML output which is encrypted but i dont know how to reply with that SAML post to get to the actual page.

Related

HTTP response codes coming wrongly where it is actually 200

I am trying to extract the HTTP links from an XML. Then trying to get the http response code for the same. But interestingly, i am getting either 500 or 400. If i click on the url, i will get the image properly in the browser.
My Code is:
def extract_src_link(path):
with open(path, 'r') as myfile:
for line in myfile:
if "src" in line:
src_link = re.search('src=(.+?)ptype="2"', line)
url = src_link.group(1)
url = url[1:-1]
#print ("url:", url)
resp = requests.head(url)
print(resp.status_code)
Not sure whats happening here. This is how my output looks like
/usr/local/bin/python2.7
/Users/rradhakrishnan/Projects/eVision/Scripts/xml_validator_ver3.py
Processing:
/Users/rradhakrishnan/rradhakrishnan1/mobily/E30000001554522119_2020_01_27T17_35_40Z.xml
500
404
Processing:
/Users/rradhakrishnan/rradhakrishnan1/mobily/E30000001557496079_2020_01_27T17_35_40Z.xml
500
404
This is how my output looks like:
I somehow managed to crack it down. Adding the User Agent did resolve the issue.
def extract_src_link(path):
with open(path, 'r') as myfile:
for line in myfile:
if "src" in line:
src_link = re.search('src=(.+?)ptype="2"', line)
url = src_link.group(1)
url = url[1:-1]
print ("url:", url)
# resp = requests.head(url)
# print(resp.status_code)
headers ={'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36'}
r = requests.get('http://www.booking.com/reviewlist.html?cc1=tr;pagename=sapphire', headers=headers)
print r.status_code

selenium with chromedriver on centOS7 for spidering

I trying to make Crawler for my server.
I Found chilkat Lib's CKSpider, but it is not support JS Rendering.
So I try to use selenium webdriver with Chrome.
I run with CentOS7, python2.7
I want spider all page with 1 baseDomain.
Example
BaseDomain = example.com
then find all page something like
example.com/event/.../../...
example.com/games/.../...
example.com/../.../..
...
My Crawler code
from selenium import webdriver
import time
options = webdriver.ChromeOptions()
options.binary_location = "/usr/bin/google-chrome"
chrome_driver_binary = "/root/chromedriver"
options.add_argument("--headless")
options.add_argument("user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36")
options.add_argument("lang=ko-KR,ko,en-US,en")
options.add_argument("--window-size=1920x1080")
options.add_argument("--disable-gpu")
options.add_argument("--no-sandbox")
options.add_argument("--disable-extensions")
driver = webdriver.Chrome(chrome_driver_binary, chrome_options=options)
host = example.com
def Crawler(Url):
driver.get(Url)
driver.implicitly_wait(3)
#Do Something
time.sleep(3)
#Crawl next
Crawler(host)
driver.quit()
How can I crawl next page? Is there any other way in selenium
Or need other Lib for that?
Thanks for any Tips or Advice.

web scraper with HTTP Error 503: Service Unavailable

I am trying to build a scraper, but I keep getting the 503 blocking error. I can still access the website manually, so my IP address hasn't been blocked. I keep switching user agents and still can't get my code to run all the way through. Sometimes I get up to 15, sometimes I don't get any, but it always fails eventually. I have no doubt that I'm doing something wrong in my code. I did shave it down to fit, though, so please keep that in mind. How do I fix this without using third parties?
import requests
import urllib2
from urllib2 import urlopen
import random
from contextlib import closing
from bs4 import BeautifulSoup
import ssl
import parser
import time
from time import sleep
def Parser(urls):
randomint = random.randint(0, 2)
randomtime = random.randint(5, 30)
url = "https://www.website.com"
user_agents = [
"Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)",
"Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)",
"Opera/9.80 (Windows NT 6.1; U; cs) Presto/2.2.15 Version/10.00"
]
index = 0
opener = urllib2.build_opener()
req = opener.addheaders = [('User-agent', user_agents[randomint])]
def ReadUPC():
UPCList = [
'upc',
'upc2',
'upc3',
'upc4',
'etc.'
]
extracted_data = []
for i in UPCList:
urls = "https://www.website.com" + i
randomtime = random.randint(5, 30)
Soup = BeautifulSoup(urlopen(urls), "lxml")
price = Soup.find("span", { "class": "a-size-base a-color-price s-price a-text-bold"})
sleep(randomtime)
randomt = random.randint(5, 15)
print "ref url:", urls
sleep(randomt)
print "Our price:",price
sleep(randomtime)
if __name__ == "__main__":
ReadUPC()
index = index + 1
sleep(10)
554 class HTTPDefaultErrorHandler(BaseHandler):
555 def http_error_default(self, req, fp, code, msg, hdrs):
556 raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
557
558 class HTTPRedirectHandler(BaseHandler):
HTTPError: HTTP Error 503: Service Unavailable
What website you are scraping? most websites uses cookies to recognize the user as well. Please enable cookies in your code.
Also open that link in browser and along with Firebug and see Headers being sent to server by your browser while making request. and then try to fake all those headers.
PS:
In my view, sending random user-agent strings from SAME IP wont make any difference, unless you are rotating IPs.
Behave like a normal human being using a browser. That website seems to be designed to analyze your behaviour and sees that you're a scraper, and wants to block you; in the easiest case, a minimal JavaScript that changes link URLs on the fly would be enough to disable "dumb" scrapers.
There's elegant ways to solve this dilemma, for example by instrumenting a browser, but that won't happen without external tools.

Scraping aspx with Python mechanize - getting search results

I've been trying to scrape Congressional financial disclosure reports using mechanize; the form submits successfully, but I can't locate any of the search results. My script is below:
br = Browser()
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
br.open('http://clerk.house.gov/public_disc/financial-search.aspx')
br.select_form(name='aspnetForm')
br.set_all_readonly(False)
br['filing_year'] = ['2008']
response = br.submit(name='search_btn')
html = response.read()
I'm new to scraping, and would appreciate any corrections/advice on this. Thanks!
This is an alternative solution that involves a real browser with the help of selenium tool.
from selenium import webdriver
from selenium.webdriver.support.select import Select
# initialize webdriver instance and visit url
url = "http://clerk.house.gov/public_disc/financial-search.aspx"
browser = webdriver.Firefox()
browser.get(url)
# find select tag and select 2008
select = Select(browser.find_element_by_id('ctl00_cphMain_txbFiling_year'))
select.select_by_value('2008')
# find "search" button and click it
button = browser.find_element_by_id('ctl00_cphMain_btnSearch')
button.click()
# display results
table = browser.find_element_by_id('search_results')
for row in table.find_elements_by_tag_name('tr')[1:-1]:
print [cell.text for cell in row.find_elements_by_tag_name('td')]
# close the browser
browser.close()
Prints:
[u'ABERCROMBIE, HON.NEIL', u'HI01', u'2008', u'FD Amendment']
[u'ABERCROMBIE, HON.NEIL', u'HI01', u'2008', u'FD Original']
[u'ACKERMAN, HON.GARY L.', u'NY05', u'2008', u'FD Amendment']
[u'ACKERMAN, HON.GARY L.', u'NY05', u'2008', u'FD Amendment']
...

django-debug-toolbar logging doesn't work?

I can't figure out how to use this plugin...
def homepage(request):
print request.META['HTTP_USER_AGENT']
print 'test'
return render(request, 'base.html')
after this , in logging tab some output must appear. In console i've got:
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_6; en-US) AppleWebKit/534.13 (KHTML, like Gecko) Chrome/9.0.597.107 Safari/534.13
test
in django-debug-toolbar logging tab i have "No messages logged."
What do I do wrong?
You need to use the logging module for this to work.
import logging
logger = logging.getLogger(__name__)
logger.debug('Test')
django-toolbar intercepts this call and will add it to the toolbar. When you do print('test') it just goes to standard out.