Python, Mechanize | Having issues with being logged out - python-2.7

I'm learning python and running my first scraper using mechanize.
Goal: Login to a website, navigate to a list of URL's and return a chunk of text then write that to a CSV.
The code I'm using works great but as I run down the list I'm getting issues where after 10- 15 rows the entire html page is coming back rather than that chunk of text. Once the error happens it won't function correctly until I run the operation again but hits the snag after 10-15 again.
After looking looking at the html it looks as though I'm being logged out and can't figure out why. The URLs are all legit, if I test out getNum individually for all of the links it seems to work fine.
Last thing - The site needs Javascript and cookies to login
Here are what the functions look like
def get_Num(link): #takes in a URL, provided by the csv and finds the chunk of text I'm looking for
import urllib2
import cookielib
import urllib
import requests
import mechanize
# Browser
br = mechanize.Browser()
# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
# Browser options
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_refresh(False)
br.set_handle_referer(True)
br.set_handle_robots(False)
# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
# User-Agent (this is cheating, ok?)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
# The site we will navigate into, handling it's session
br.open('https://www.website.com/login')
#select the first form
br.select_form(nr=0)
#user credentials
br['session_key'] = login
br['session_password'] = pw
# Login
br.submit()
print 'logged in'
#open link
br.open(link)
html = br.response().read()
position1 = html.find('text')
position2 = html.find('>',position1)
targetNumber = html[position1:position2]
return targetNumber
def get_Info(inputFile,outputFile): # takes a CSV and runs getNum for ever url and then writes the whole thing to a csv
import urllib2
import re
import csv
with open(inputFile, "rb") as csvinput:
with open(outputFile, 'w+') as csvoutput:
reader = csv.reader(csvinput)
writer = csv.writer(csvoutput)
for rowNum, line in enumerate(reader):
vv = getNum(str(line[1]))
line.append(vv)

Related

Persistent session in mechanize (Python) / Navigate to another site after login check

I need to login to a site at one url (ex: 'www.targetsite.com/login') and then navigate to another site to scrape data (ex: 'www.targetsite.com/data'). This is because the site auto directs you to the home page after you login, no matter which url you used to access the site to begin with.
I'm using the mechanize python library (old I know, but it has some functions I'll need later on & is a good learning experience).
The problem I'm facing is that the cookiejar doesn't seem to be working the way I thought it would
import mechanize
import Cookie
import cookielib
cj = cookielib.LWPCookieJar()
br = mechanize.Browser()
br.set_cookiejar(cj)
###browser emulation
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
###login
login_url = "https://targetsite.org/login"
br.open(login_url)
br.select_form(action="https://targetsite.org/wp-login.php?wpe-login=true")
br.form['log'] = 'login'
br.form['pwd'] = 'password
br.submit()
target_url = "https://targetsite.com/data"
br.open(target_url)
soup = BeautifulSoup(br.response().read())
body_tag = soup.body
all_paragraphs = soup.find_all('p')
print(body_tag.text)
Wierdly, the site doesn't seem to be registering my logged in state and is redirecting my mechanise br back to the login screen. Any idea of what's going on?

Mechanize: redirect to a url after login which is different from the landing page after login

I'm trying to login to a page of a website using mechanize(python). When I login, the page is redirected to the landing page. But, I wish to open another URL which can only be opened after login. Can anyone please tell me how to do it? I checked out the documentation but couldn't find an answer. Google search didn't help too. Please help.main.py
import mechanize,cookielib
from BeautifulSoup import BeautifulSoup
from CAPTCHA import CaptchaParser
from PIL import Image
from StringIO import StringIO
br = mechanize.Browser()
...
br.set_handle_redirect(True)
br.set_handle_robots(False)
br.set_handle_referer(True)
br.set_handle_refresh(True)
br.set_cookiejar(cookielib.CookieJar())
br.set_handle_redirect(mechanize.HTTPRedirectHandler)
...
login = br.open("https://vtop.vit.ac.in/student/stud_login.asp")
html = login.read()
soup = BeautifulSoup(html)
im = soup.find('img',id='imgCaptcha')
imgresponse = br.open_novisit(im['src'])
image = Image.open(StringIO(imgresponse.read()))
parser = CaptchaParser()
captcha = parser.getCaptcha(image)
br.select_form(name="stud_login")
regno=raw_input("Enter registration number: ")
passwd=raw_input("Enter passowrd: ")
br['regno']=regno
br['passwd']=passwd
br['vrfcd']=captcha
br.submit()
if br.geturl()!='https://vtop.vit.ac.in/student/home.asp':
print "login error"
The page which is opened is https://vtop.vit.ac.in/student/home.asp. From here I wish to redirect to https://vtop.vit.ac.in/student/coursepage_plan_view.asp?sem=FS.

How can i go to specific page of a website and fetch desired data using python and save it into excel sheet.this code need url till desired page

import requests from bs4
import BeautifulSoup
import xlrd file="C:\Users\Ashadeep\PycharmProjects\untitled1\xlwt.ashadee.xls"
workbook=xlrd.open_workbook(file)
sheet=workbook.sheet_by_index(0)
print(sheet.cell_value(0,0))
r = requests.get(sheet.cell_value(0,0))
r.content soup = BeautifulSoup(r.content,"html.parser") g_data=soup.find_all("div",{"class":"admissionhelp-left"})
print(g_data)
text=soup.find_all("Tel") for item in g_data:print(item.text)
Are you trying to download an Excel file from the web and save it to your HDD? I don't see any URL, but you can try one of these 3 ideas.
import urllib
dls = "http://www.muellerindustries.com/uploads/pdf/UW SPD0114.xls"
urllib.urlretrieve(dls, "test.xls")
import requests
dls = "http://www.muellerindustries.com/uploads/pdf/UW SPD0114.xls"
resp = requests.get(dls)
with open('test.xls', 'wb') as output:
output.write(resp.content)
Or, if you don't necessarily need to go through the browser, you can use the urllib module to save a file to a specified location.
import urllib
url = 'http://www.example.com/file/processing/path/excelfile.xls'
local_fname = '/home/John/excelfile.xls'
filename, headers = urllib.retrieveurl(url, local_fname)

Mechanize not working as expected

I am trying to use Mechanize to login to a page and download a file but it doesn't seem to work as expected. This is after I have failed with urllib2 and requests.
import mechanize
username = '***'
password = '***'
url = 'https://emea1.login.cp.thomsonreuters.net/auth/UI/Login'
br = mechanize.Browser()
br.set_handle_robots(False)
request = br.open(url)
br.select_form(name="frmSignIn")
br['IDToken1'] = username # Set the form values
br['IDToken2'] = password # Set the form values
resp = br.submit()
print resp.read()
The output is same as the source code of url, same was the case with requests and urllib
PS: I tried using selenium only to find my corporate server won't have firefox etc. installed.

Scraping using Mechanize and BS4

I am trying to scrape articles from the Wall Street Journal. This involves logging in using mechanize and scraping using BeautifulSoup. I was hoping someone could take a look at my code and explain to me why it's not working.
I am using python 2.7 on a 2012 MacBook Pro running the latest software. I'm new to python so explain to me like I'm 5. Any advice would be deeply appreciated. Thanks in advance.
from bs4 import BeautifulSoup
import cookielib
import mechanize
#Browser
br = mechanize.Browser()
#Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
# Browser options
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
# User-Agent
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.set_debug_http(True) # Print HTTP headers.
# Want more debugging messages?
#br.set_debug_redirects(True)
#br.set_debug_responses(True)
# The site we will navigate into, handling it's session
br.open('https://id.wsj.com/access/pages/wsj/us/login_standalone.html?mg=id-wsj')
# Select the first (index zero) form
br.select_form(nr=0)
# User credentials
br.form['username'] = 'username'
br.form['password'] = 'password'
# Login
br.submit()
#br.open("http://online.wsj.com/home-page")
br.open("http://online.wsj.com/news/articles/SB10001424052702304626304579506924089231470?mod=WSJ_hp_LEFTTopStories&mg=reno64-wsj&url=http%3A%2F%2Fonline.wsj.com%2Farticle%2FSB10001424052702304626304579506924089231470.html%3Fmod%3DWSJ_hp_LEFTTopStories&cb=logged0.9458705162058179")
soup = BeautifulSoup(br.response().read())
title = soup.find('h1')
print title