Mechanize not working as expected - python-2.7

I am trying to use Mechanize to login to a page and download a file but it doesn't seem to work as expected. This is after I have failed with urllib2 and requests.
import mechanize
username = '***'
password = '***'
url = 'https://emea1.login.cp.thomsonreuters.net/auth/UI/Login'
br = mechanize.Browser()
br.set_handle_robots(False)
request = br.open(url)
br.select_form(name="frmSignIn")
br['IDToken1'] = username # Set the form values
br['IDToken2'] = password # Set the form values
resp = br.submit()
print resp.read()
The output is same as the source code of url, same was the case with requests and urllib
PS: I tried using selenium only to find my corporate server won't have firefox etc. installed.

Related

Persistent session in mechanize (Python) / Navigate to another site after login check

I need to login to a site at one url (ex: 'www.targetsite.com/login') and then navigate to another site to scrape data (ex: 'www.targetsite.com/data'). This is because the site auto directs you to the home page after you login, no matter which url you used to access the site to begin with.
I'm using the mechanize python library (old I know, but it has some functions I'll need later on & is a good learning experience).
The problem I'm facing is that the cookiejar doesn't seem to be working the way I thought it would
import mechanize
import Cookie
import cookielib
cj = cookielib.LWPCookieJar()
br = mechanize.Browser()
br.set_cookiejar(cj)
###browser emulation
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
###login
login_url = "https://targetsite.org/login"
br.open(login_url)
br.select_form(action="https://targetsite.org/wp-login.php?wpe-login=true")
br.form['log'] = 'login'
br.form['pwd'] = 'password
br.submit()
target_url = "https://targetsite.com/data"
br.open(target_url)
soup = BeautifulSoup(br.response().read())
body_tag = soup.body
all_paragraphs = soup.find_all('p')
print(body_tag.text)
Wierdly, the site doesn't seem to be registering my logged in state and is redirecting my mechanise br back to the login screen. Any idea of what's going on?

Mechanize: redirect to a url after login which is different from the landing page after login

I'm trying to login to a page of a website using mechanize(python). When I login, the page is redirected to the landing page. But, I wish to open another URL which can only be opened after login. Can anyone please tell me how to do it? I checked out the documentation but couldn't find an answer. Google search didn't help too. Please help.main.py
import mechanize,cookielib
from BeautifulSoup import BeautifulSoup
from CAPTCHA import CaptchaParser
from PIL import Image
from StringIO import StringIO
br = mechanize.Browser()
...
br.set_handle_redirect(True)
br.set_handle_robots(False)
br.set_handle_referer(True)
br.set_handle_refresh(True)
br.set_cookiejar(cookielib.CookieJar())
br.set_handle_redirect(mechanize.HTTPRedirectHandler)
...
login = br.open("https://vtop.vit.ac.in/student/stud_login.asp")
html = login.read()
soup = BeautifulSoup(html)
im = soup.find('img',id='imgCaptcha')
imgresponse = br.open_novisit(im['src'])
image = Image.open(StringIO(imgresponse.read()))
parser = CaptchaParser()
captcha = parser.getCaptcha(image)
br.select_form(name="stud_login")
regno=raw_input("Enter registration number: ")
passwd=raw_input("Enter passowrd: ")
br['regno']=regno
br['passwd']=passwd
br['vrfcd']=captcha
br.submit()
if br.geturl()!='https://vtop.vit.ac.in/student/home.asp':
print "login error"
The page which is opened is https://vtop.vit.ac.in/student/home.asp. From here I wish to redirect to https://vtop.vit.ac.in/student/coursepage_plan_view.asp?sem=FS.

Python mechanize implementation of HTTP Basic Auth

I could get HTTP Basic Authentication to work using requests:
import requests
request = requests.post(url, auth=(user, pass), data={'a':'whatever'})
And also using urllib2 and urllib:
import urllib2, urllib
passman = urllib2.HTTPPasswordMgrWithDefaultRealm()
passman.add_password(None, url, user, pass)
auth_handler = urllib2.HTTPBasicAuthHandler(passman)
opener = urllib2.build_opener(auth_handler)
urllib2.install_opener(opener)
content = urllib2.urlopen(url, urllib.urlencode({'a': 'whatever'}))
The problem is I get an unauthorized error when I try the same thing with mechanize:
import mechanize, urllib
from base64 import b64encode
browser = mechanize.Browser()
b64login = b64encode('%s:%s' % (user, pass))
browser.addheaders.append(('Authorization', 'Basic %s' % b64login ))
request = mechanize.Request(url)
response = mechanize.urlopen(request, data=urllib.urlencode({'a':'whatever}))
error:
HTTPError: HTTP Error 401: UNAUTHORIZED
The code I tried with mechanize could be trying to authenticate in a different way than the other two code snippets. So the question is how could the same authentication process be achieved in mechanize.
I am using python 2.7.12
The header should have been added to the request instead of the browser. In fact the browser variable isn't even needed.
import mechanize, urllib
from base64 import b64encode
b64login = b64encode('%s:%s' % (user, pass))
request = mechanize.Request(url)
request.add_header('Authorization', 'Basic %s' % b64login )
response = mechanize.urlopen(request, data=urllib.urlencode({'a':'whatever'}))

Scraping using Mechanize and BS4

I am trying to scrape articles from the Wall Street Journal. This involves logging in using mechanize and scraping using BeautifulSoup. I was hoping someone could take a look at my code and explain to me why it's not working.
I am using python 2.7 on a 2012 MacBook Pro running the latest software. I'm new to python so explain to me like I'm 5. Any advice would be deeply appreciated. Thanks in advance.
from bs4 import BeautifulSoup
import cookielib
import mechanize
#Browser
br = mechanize.Browser()
#Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
# Browser options
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
# User-Agent
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.set_debug_http(True) # Print HTTP headers.
# Want more debugging messages?
#br.set_debug_redirects(True)
#br.set_debug_responses(True)
# The site we will navigate into, handling it's session
br.open('https://id.wsj.com/access/pages/wsj/us/login_standalone.html?mg=id-wsj')
# Select the first (index zero) form
br.select_form(nr=0)
# User credentials
br.form['username'] = 'username'
br.form['password'] = 'password'
# Login
br.submit()
#br.open("http://online.wsj.com/home-page")
br.open("http://online.wsj.com/news/articles/SB10001424052702304626304579506924089231470?mod=WSJ_hp_LEFTTopStories&mg=reno64-wsj&url=http%3A%2F%2Fonline.wsj.com%2Farticle%2FSB10001424052702304626304579506924089231470.html%3Fmod%3DWSJ_hp_LEFTTopStories&cb=logged0.9458705162058179")
soup = BeautifulSoup(br.response().read())
title = soup.find('h1')
print title

Python, Mechanize | Having issues with being logged out

I'm learning python and running my first scraper using mechanize.
Goal: Login to a website, navigate to a list of URL's and return a chunk of text then write that to a CSV.
The code I'm using works great but as I run down the list I'm getting issues where after 10- 15 rows the entire html page is coming back rather than that chunk of text. Once the error happens it won't function correctly until I run the operation again but hits the snag after 10-15 again.
After looking looking at the html it looks as though I'm being logged out and can't figure out why. The URLs are all legit, if I test out getNum individually for all of the links it seems to work fine.
Last thing - The site needs Javascript and cookies to login
Here are what the functions look like
def get_Num(link): #takes in a URL, provided by the csv and finds the chunk of text I'm looking for
import urllib2
import cookielib
import urllib
import requests
import mechanize
# Browser
br = mechanize.Browser()
# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
# Browser options
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_refresh(False)
br.set_handle_referer(True)
br.set_handle_robots(False)
# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
# User-Agent (this is cheating, ok?)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
# The site we will navigate into, handling it's session
br.open('https://www.website.com/login')
#select the first form
br.select_form(nr=0)
#user credentials
br['session_key'] = login
br['session_password'] = pw
# Login
br.submit()
print 'logged in'
#open link
br.open(link)
html = br.response().read()
position1 = html.find('text')
position2 = html.find('>',position1)
targetNumber = html[position1:position2]
return targetNumber
def get_Info(inputFile,outputFile): # takes a CSV and runs getNum for ever url and then writes the whole thing to a csv
import urllib2
import re
import csv
with open(inputFile, "rb") as csvinput:
with open(outputFile, 'w+') as csvoutput:
reader = csv.reader(csvinput)
writer = csv.writer(csvoutput)
for rowNum, line in enumerate(reader):
vv = getNum(str(line[1]))
line.append(vv)