I am trying to scrape articles from the Wall Street Journal. This involves logging in using mechanize and scraping using BeautifulSoup. I was hoping someone could take a look at my code and explain to me why it's not working.
I am using python 2.7 on a 2012 MacBook Pro running the latest software. I'm new to python so explain to me like I'm 5. Any advice would be deeply appreciated. Thanks in advance.
from bs4 import BeautifulSoup
import cookielib
import mechanize
#Browser
br = mechanize.Browser()
#Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
# Browser options
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
# User-Agent
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.set_debug_http(True) # Print HTTP headers.
# Want more debugging messages?
#br.set_debug_redirects(True)
#br.set_debug_responses(True)
# The site we will navigate into, handling it's session
br.open('https://id.wsj.com/access/pages/wsj/us/login_standalone.html?mg=id-wsj')
# Select the first (index zero) form
br.select_form(nr=0)
# User credentials
br.form['username'] = 'username'
br.form['password'] = 'password'
# Login
br.submit()
#br.open("http://online.wsj.com/home-page")
br.open("http://online.wsj.com/news/articles/SB10001424052702304626304579506924089231470?mod=WSJ_hp_LEFTTopStories&mg=reno64-wsj&url=http%3A%2F%2Fonline.wsj.com%2Farticle%2FSB10001424052702304626304579506924089231470.html%3Fmod%3DWSJ_hp_LEFTTopStories&cb=logged0.9458705162058179")
soup = BeautifulSoup(br.response().read())
title = soup.find('h1')
print title
Related
I need to login to a site at one url (ex: 'www.targetsite.com/login') and then navigate to another site to scrape data (ex: 'www.targetsite.com/data'). This is because the site auto directs you to the home page after you login, no matter which url you used to access the site to begin with.
I'm using the mechanize python library (old I know, but it has some functions I'll need later on & is a good learning experience).
The problem I'm facing is that the cookiejar doesn't seem to be working the way I thought it would
import mechanize
import Cookie
import cookielib
cj = cookielib.LWPCookieJar()
br = mechanize.Browser()
br.set_cookiejar(cj)
###browser emulation
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
###login
login_url = "https://targetsite.org/login"
br.open(login_url)
br.select_form(action="https://targetsite.org/wp-login.php?wpe-login=true")
br.form['log'] = 'login'
br.form['pwd'] = 'password
br.submit()
target_url = "https://targetsite.com/data"
br.open(target_url)
soup = BeautifulSoup(br.response().read())
body_tag = soup.body
all_paragraphs = soup.find_all('p')
print(body_tag.text)
Wierdly, the site doesn't seem to be registering my logged in state and is redirecting my mechanise br back to the login screen. Any idea of what's going on?
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
import os
xpaths = { 'video' : "//video[#id='video']",
}
from selenium import webdriver
profile = webdriver.FirefoxProfile()
profile.set_preference("general.useragent.override", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36")
driver = webdriver.Firefox(profile)
mydriver = webdriver.Firefox()
baseurl = "XXXX"
mydriver.get(baseurl)
It's not changing the user agent. I want the user agent to be chrome. I don't know what's wrong...
And also, here's what i'd like it to do: Go to the website, if it redirects to another url > Goes back to main page and keeps doing that until it finds (id:video)
I have not implemented this yet because i have no idea how to...
The website i'm trying to automate got a vid and it appears sometimes. What i'd like this to do is keep visiting the website until it finds the id:video clicks it and waits.
Help is appreciated :)
You are navigating to your application URL using the wrong firefox instance - mydriver. Using the correct firefox instance (with required profile setting) should do the work (which is driver in your case).
Below is the correct code:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
import os
xpaths = { 'video' : "//video[#id='video']",
}
profile = webdriver.FirefoxProfile()
profile.set_preference("general.useragent.override", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36")
driver = webdriver.Firefox(profile)
# the below line is not required
#mydriver = webdriver.Firefox()
baseurl = "XXXX"
# navigate to url with 'driver' instead of 'mydriver'
driver.get(baseurl)
If you change your baseurl to "http://whatsmyuseragent.com/", you will be able to right away see if the user agent change is reflected correctly.
Hope this helps!
I am trying to use Mechanize to login to a page and download a file but it doesn't seem to work as expected. This is after I have failed with urllib2 and requests.
import mechanize
username = '***'
password = '***'
url = 'https://emea1.login.cp.thomsonreuters.net/auth/UI/Login'
br = mechanize.Browser()
br.set_handle_robots(False)
request = br.open(url)
br.select_form(name="frmSignIn")
br['IDToken1'] = username # Set the form values
br['IDToken2'] = password # Set the form values
resp = br.submit()
print resp.read()
The output is same as the source code of url, same was the case with requests and urllib
PS: I tried using selenium only to find my corporate server won't have firefox etc. installed.
I'm learning python and running my first scraper using mechanize.
Goal: Login to a website, navigate to a list of URL's and return a chunk of text then write that to a CSV.
The code I'm using works great but as I run down the list I'm getting issues where after 10- 15 rows the entire html page is coming back rather than that chunk of text. Once the error happens it won't function correctly until I run the operation again but hits the snag after 10-15 again.
After looking looking at the html it looks as though I'm being logged out and can't figure out why. The URLs are all legit, if I test out getNum individually for all of the links it seems to work fine.
Last thing - The site needs Javascript and cookies to login
Here are what the functions look like
def get_Num(link): #takes in a URL, provided by the csv and finds the chunk of text I'm looking for
import urllib2
import cookielib
import urllib
import requests
import mechanize
# Browser
br = mechanize.Browser()
# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
# Browser options
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_refresh(False)
br.set_handle_referer(True)
br.set_handle_robots(False)
# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
# User-Agent (this is cheating, ok?)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
# The site we will navigate into, handling it's session
br.open('https://www.website.com/login')
#select the first form
br.select_form(nr=0)
#user credentials
br['session_key'] = login
br['session_password'] = pw
# Login
br.submit()
print 'logged in'
#open link
br.open(link)
html = br.response().read()
position1 = html.find('text')
position2 = html.find('>',position1)
targetNumber = html[position1:position2]
return targetNumber
def get_Info(inputFile,outputFile): # takes a CSV and runs getNum for ever url and then writes the whole thing to a csv
import urllib2
import re
import csv
with open(inputFile, "rb") as csvinput:
with open(outputFile, 'w+') as csvoutput:
reader = csv.reader(csvinput)
writer = csv.writer(csvoutput)
for rowNum, line in enumerate(reader):
vv = getNum(str(line[1]))
line.append(vv)
I am trying to login myself to mail.yahoo using mechanize. I followed some tutorials/questions about this, but everything I try fails.
Can someone point me to the right direction?
My code:
#!/usr/bin/python
import re
import mechanize
br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.6')]
br.set_handle_robots(False)
response = br.open("https://login.yahoo.com/config/login_verify2?&.src=ym&.intl=us")
#assert br.viewing_html()
print response.get_data()
br.select_form(nr=0)
br["login"] = "username"
br["passwd"] = "password"
try:
response = br.submit()
print response.get_data()
except HTTPError, e:
sys.exit("post failed: %d: %s" % (e.code, e.msg))
It seems my code failes because the username/password is wrong. I doubled checked!