Interacting with websites using selenium - python-2.7

I am trying to interact with websites using the package "selenium". I have a problem understanding what this line is doing:
elem = driver.find_element_by_name("q")
The line before that checks that the site contains the word "python" in the title. Then this line somehow finds the search text box on the webpage with the letter "q". The package documentation jumped over this point, what am I missing?
Full code:
import unittest
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
class PythonOrgSearch(unittest.TestCase):
def setUp(self):
self.driver = webdriver.Firefox()
def test_search_in_python_org(self):
driver = self.driver
driver.get("http://www.python.org")
self.assertIn("Python", driver.title)
elem = driver.find_element_by_name("q")
elem.send_keys("pycon")
assert "No results found." not in driver.page_source
elem.send_keys(Keys.RETURN)
def tearDown(self):
self.driver.close()
if __name__ == "__main__":
unittest.main()
So far I can see that I can find certain elements using:
find_elements_by_name
find_elements_by_xpath
find_elements_by_link_text
find_elements_by_partial_link_text
find_elements_by_tag_name
find_elements_by_class_name
find_elements_by_css_selector
But why does "q" specifically point to the search box on the python website?

If I go into the debugger to inspect the element that serves as the search box at the top of the www.python.org page, this is what I see:
<input id="id-search-field" name="q" role="textbox" class="search-field placeholder" placeholder="Search" tabindex="1" type="search">
Note the attribute name="q". This element is named q so driver.find_element_by_name("q") finds it.

Related

How to scrape pages after login

I try to find a way to scrape and parse more pages in the signed in area.
These example links accesible from signed in I want to parse.
#http://example.com/seller/demand/?id=305554
#http://example.com/seller/demand/?id=305553
#http://example.com/seller/demand/?id=305552
#....
I want to create spider that can open each one of these links and then parse them.
I have created another spider which can open and parse only one of them.
When I tried to create "for" or "while" to call more requests with other links it allowed me not because I cannot put more returns into generator, it returns error. I also tried link extractors, but it didn't work for me.
Here is my code:
#!c:/server/www/scrapy
# -*- coding: utf-8 -*-
from scrapy import Spider
from scrapy.selector import Selector
from scrapy.http import FormRequest
from scrapy.http.request import Request
from scrapy.spiders import CrawlSpider, Rule
from array import *
from stack.items import StackItem
from scrapy.linkextractors import LinkExtractor
class Spider3(Spider):
name = "Spider3"
allowed_domains = ["example.com"]
start_urls = ["http://example.com/login"] #this link lead to login page
When I am signed in it returns page with url, that contains "stat", that is why I put here first "if" condition.
When I am signed in, I request one link and call function parse_items.
def parse(self, response):
#when "stat" is in url it means that I just signed in
if "stat" in response.url:
return Request("http://example.com/seller/demand/?id=305554", callback = self.parse_items)
else:
#this succesful login turns me to page, it's url contains "stat"
return [FormRequest.from_response(response,
formdata={'ctl00$ContentPlaceHolder1$lMain$tbLogin': 'my_login', 'ctl00$ContentPlaceHolder1$lMain$tbPass': 'my_password'},callback=self.parse)]
Function parse_items simply parse desired content from one desired page:
def parse_items(self,response):
questions = Selector(response).xpath('//*[#id="ctl00_ContentPlaceHolder1_cRequest_divAll"]/table/tr')
for question in questions:
item = StackItem()
item['name'] = question.xpath('th/text()').extract()[0]
item['value'] = question.xpath('td/text()').extract()[0]
yield item
Can you help me please to update this code to open and parse more than one page in each sessions?
I don't want to sign in over and over for each request.
The session most likely depends on the cookies and scrapy manages that by itself. I.e:
def parse_items(self,response):
questions = Selector(response).xpath('//*[#id="ctl00_ContentPlaceHolder1_cRequest_divAll"]/table/tr')
for question in questions:
item = StackItem()
item['name'] = question.xpath('th/text()').extract()[0]
item['value'] = question.xpath('td/text()').extract()[0]
yield item
next_url = '' # find url to next page in the current page
if next_url:
yield Request(next_url, self.parse_items)
# scrapy will retain the session for the next page if it's managed by cookies
I am currently working on the same problem. I use InitSpider so I can overwrite __init__ and init_request. The first is just for initialisation of custom stuff and the actual magic happens in my init_request:
def init_request(self):
"""This function is called before crawling starts."""
# Do not start a request on error,
# simply return nothing and quit scrapy
if self.abort:
return
# Do a login
if self.login_required:
# Start with login first
return Request(url=self.login_page, callback=self.login)
else:
# Start with pase function
return Request(url=self.base_url, callback=self.parse)
My login looks like this
def login(self, response):
"""Generate a login request."""
self.log('Login called')
return FormRequest.from_response(
response,
formdata=self.login_data,
method=self.login_method,
callback=self.check_login_response
)
self.login_data is a dict with post values.
I am still a beginner with python and scrapy, so I might be doing it the wrong way. Anyway, so far I have produced a working version that can be viewed on github.
HTH:
https://github.com/cytopia/crawlpy

Selenium/Python Finding Element and Clicking it

I have been researching for a while about this and here is the code I wrote
driver = webdriver.Firefox()
time.sleep(10)
get("some website")
time.sleep(10)
x = driver.find_element_by_id("vB_Editor_QR_textarea")
x.click()
It keeps giving me error the part it's not working is getting the find_element and click()
It keeps giving me error from webdriver.py
here is the screen shot of the error note: i don't have a mouse at the moment so i just took a pic
https://gyazo.com/bc6f8d3e77f2e9d9b5bcbfe202b73258
You should try this:
driver = webdriver.Firefox()
driver.get("https://example.com") # Make sure you use double quotes
And instead of time.sleep() you should use implicit and explicit waits. I usually use an implicit wait.
driver.implicitly_wait(10) # 10 seconds
An implicit wait is to tell WebDriver to poll the DOM for a certain amount of time when trying to find an element or elements if they are not immediately available.
Try this simple google search automation:
import unittest
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
class AutoTest(unittest.TestCase):
def setUp(self):
self.driver = webdriver.Firefox()
def test_auto_test(self):
driver = self.driver
driver.get("http://www.google.com")
element = driver.find_element_by_css_selector('#lst-ib')
element.send_keys("StackOverflow")
element = driver.find_element_by_css_selector('#sblsbb > button > span').click()
if __name__ == "__main__":
unittest.main()

webscraping an .ASPX site with Selenium and/or Scrapy

I am new to Python/Selenium and coded the following in python /Windows to scrape the 5484 physician demo's in the, MA-Board of Reg. Website.
My Issue: The website is .aspx, so I initially chose Selenium. However, would really appreciate any insights/recommendations on coding the next steps (see below). More specifically, if it is more efficient to continue with selenium or incorporate scrapy? Any insights are greatly appreciated!:
Select each physician's hyperlink (1-10 per page) by clicking each hyperlinked "PhysicianProfile.aspx?PhysicianID=XXXX" on the "ChooseAPhysician page".
Follow each, and Extract the, "Demographic info"
Demographic info: "phy_name", "lic_issue_date", prim_worksetting, etc
Return to, "ChooseAPhysician page", click "Next"
Repeat for additional 5474 physician
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Chrome() driver.get('http://profiles.ehs.state.ma.us/Profiles/Pages/ChooseAPhysician.aspx?Page=1')
#Locate the elements
zip = driver.find_element_by_xpath("//*[#id=\"ctl00_ContentPlaceHolder1_txtZip\"]")
select = Select(driver.find_element_by_xpath("//select[#id=\"ctl00_ContentPlaceHolder1_cmbDistance\"]"))
print select.options
print [o.text for o in select.options]
select.select_by_visible_text("15")
prim_care_chekbox = driver.find_element_by_xpath("//*[#id=\"ctl00_ContentPlaceHolder1_SpecialtyGroupsCheckbox_6\"]")
find_phy_button = driver.find_element_by_xpath("//*[#id=\"ctl00_ContentPlaceHolder1_btnSearch\"]")
#Input zipcode, check "primary care box", and click "find phy" button
zip.send_keys("02109")
prim_care_chekbox.click()
find_phy_button.click()
#wait for "ChooseAPhysician" page to open
wait = WebDriverWait(driver, 10)
open_phy_bio = driver.find_element_by_xpath("//*[#id=\"PhysicianSearchResultGrid\"]/tbody/tr[2]/td[1]/a")
element = wait.until(EC.element_to_be_selected(open_phy_bio))
open_phy_bio.click()
links = self.driver.find_element_by_xpath("//*[#id=\"PhysicianSearchResultGrid\"]/tbody/tr[2]/td[1]/a")
for link in links:
link = link.get_attribute("href")
self.driver.get(link)
def parse(self, response):
item = SummaryItem()
sel = self.selenium
sel.open(response.url)
time.sleep(4)
item["phy_name"] = driver.find_elements_by_xpaths("//*[#id=\"content\"]/center/p[1]").extract()
item["lic_status"] = driver.find_elements_by_xpaths("//*[#id=\"content\"]/center/table[2]/tbody/tr[3]/td/table/tbody/tr/td[1]/table/tbody/tr[2]/td[2]/a[1]").extract()
item["lic_issue_date"] = driver.find.elements_by_xpaths("//*[#id=\"content\"]/center/table[2]/tbody/tr[3]/td/table/tbody/tr/td[1]/table/tbody/tr[3]/td[2]").extract()
item["prim_worksetting"] = driver.find.elements_by_xpaths("//*[#id=\"content\"]/center/table[2]/tbody/tr[3]/td/table/tbody/tr/td[1]/table/tbody/tr[5]/td[2]").extract()
item["npi"] = driver.find_elements_by_xpaths("//*[#id=\"content\"]/center/table[2]/tbody/tr[3]/td/table/tbody/tr/td[2]/table/tbody/tr[6]/td[2]").extract()
item["Med_sch_grad_date"] = driver.find_elements_by_xpaths("//*[#id=\"content\"]/center/table[3]/tbody/tr[3]/td/table/tbody/tr[2]/td[2]").extract()
item["Area_of_speciality"] = driver.find_elements_by_xpaths("//*[#id=\"content\"]/center/table[4]/tbody/tr[3]/td/table/tbody/tr/td[2]").extract()
item["link"] = driver.find_element_by_xpath("//*[#id=\"PhysicianSearchResultGrid\"]/tbody/tr[2]/td[1]/a").extract()
return item

How to add url to bookmark?

I am using PyQt4 for creating a custom browser using QtWebKit, but I am stuck on saving bookmarks from the browser. Does anyone know how to achieve that?
You're a little vague on how you want this done, so I'll say we wanted to use a button imported from a UI file called bookmarks_Btn. You'll need to use the pickle module.
Here's the example code...
from PyQt4 import QtCore, QtGui, QtWebKit, uic
import pickle
class window(QtGui.QWidget):
def __init__(self, parent=None):
super(httpWidget, self).__init__(parent)
self.ui = uic.loadUi('mybrowser.ui')
self.ui.setupUi(self)
def bookmarksLoad(self):
print 'Loading bookmarks'
try:
bookOpen = open('bookmarks.txt', 'rb')
bookmarks = pickle.load(bookOpen)
bookOpen.close()
print bookmarks # Not necessary, but for example purposes
# Here you decide how "bookmarks" variable is displayed.
except:
bookOpen = open('bookmarks.txt', 'wb')
bookmarks = 'http://www.stackoverflow.com'
bookWrite = pickle.dump(bookmarks, bookOpen)
bookOpen.close()
print bookmarks # Not necessary, but for example purposes
# Here you decide how "bookmarks" variable is displayed.
QtCore.QObject.connect(self.ui.bookmarks_Btn, QtCore.SIGNAL('clicked()'), self.bookmarksLoad)
self.ui.show()
def bookmarks():
url = input 'Enter a URL: '
bookOpen = open('bookmarks.txt', 'wb')
bookOpen.write(url)
bookOpen.close()
print 'Website bookmarked!'
if __name__ == '__main__':
app = QtGui.QApplication(sys.argv)
run = window()
bookmarks()
sys.exit(app.exec_())
# You add on here, for example, deleting bookmarks.
However, if you wanted it to be retrieved from an address bar (named address, make the following changes...
# In the bookmarks function...
global url # Add at beginning
# Remove the input line.
# Add at end of __init__ in window class:
url = self.ui.address.text()
global url
That's pretty much the basics. Please note I normally program in Python 3 and PyQt5 so if there are any errors let me know :)

passing commandline arguments to a selenium python webdriver test case

The following code is written using selenium python web driver which is run in saucelabs.I am providing the browser name,version and platform in a list,how do i do the same by providing the browser details through command line arguments? I am using py.test to execute the test cases.
import os
import sys
import httplib
import base64
import json
import new
import unittest
import sauceclient
from selenium import webdriver
from sauceclient import SauceClient
# it's best to remove the hardcoded defaults and always get these values
# from environment variables
USERNAME = os.environ.get('SAUCE_USERNAME', "ranjanprabhub")
ACCESS_KEY = os.environ.get('SAUCE_ACCESS_KEY', "ecec4dd0-d8da-49b9-b719-17e2c43d0165")
sauce = SauceClient(USERNAME, ACCESS_KEY)
browsers = [{"platform": "Mac OS X 10.9",
"browserName": "chrome",
"version": ""},
]
def on_platforms(platforms):
def decorator(base_class):
module = sys.modules[base_class.__module__].__dict__
for i, platform in enumerate(platforms):
d = dict(base_class.__dict__)
d['desired_capabilities'] = platform
name = "%s_%s" % (base_class.__name__, i + 1)
module[name] = new.classobj(name, (base_class,), d)
return decorator
#on_platforms(browsers)
class SauceSampleTest(unittest.TestCase):
def setUp(self):
self.desired_capabilities['name'] = self.id()
sauce_url = "http://%s:%s#ondemand.saucelabs.com:80/wd/hub"
self.driver = webdriver.Remote(
desired_capabilities=self.desired_capabilities,
command_executor=sauce_url % (USERNAME, ACCESS_KEY)
)
self.driver.implicitly_wait(30)
def test_sauce(self):
self.driver.get('http://saucelabs.com/test/guinea-pig')
assert "I am a page title - Sauce Labs" in self.driver.title
comments = self.driver.find_element_by_id('comments')
comments.send_keys('Hello! I am some example comments.'
' I should be in the page after submitting the form')
self.driver.find_element_by_id('submit').click()
commented = self.driver.find_element_by_id('your_comments')
assert ('Your comments: Hello! I am some example comments.'
' I should be in the page after submitting the form'
in commented.text)
body = self.driver.find_element_by_xpath('//body')
assert 'I am some other page content' not in body.text
self.driver.find_elements_by_link_text('i am a link')[0].click()
body = self.driver.find_element_by_xpath('//body')
assert 'I am some other page content' in body.text
def tearDown(self):
print("Link to your job: https://saucelabs.com/jobs/%s" % self.driver.session_id)
try:
if sys.exc_info() == (None, None, None):
sauce.jobs.update_job(self.driver.session_id, passed=True)
else:
sauce.jobs.update_job(self.driver.session_id, passed=False)
finally:
self.driver.quit()
So this is a bit complicated because you can pass an array of browsers into the #on_platforms decorator. My solution will only work for a single browser, as it looks like that's what you're doing right now.
For the current, single browser, situation -- you're looking for argparse. Here's my suggested fix:
import argparse
def setup_parser():
parser = argparse.ArgumentParser(description='Automation Testing!')
parser.add_argument('-p', '--platform', help='Platform for desired_caps', default='Mac OS X 10.9')
parser.add_argument('-b', '--browser-name', help='Browser Name for desired_caps', default='chrome')
parser.add_argument('-v', '--version', default='')
args = vars(parser.parse_args())
return args
desired_caps = setup_parser()
browsers = [desired_caps]
print browsers
But if you're looking to test multiple browsers (which I suggest you do!), you should not try and use command line arguments for the desired_caps of each individual browser. You should instead load a json config file for the browsers and the desired_caps for each one that you want Sauce to run.
Maybe have a different config file for each set of browsers, and then use command line arguments to pass in the config files you want to load.