I'm trying to set the controls on a form from disabled readonly to something usable. But the problem is the control has no name. I saw a previous post where someone else solved this problem. But I have been unable to implement their solution correctly. Since I am new to python, I was hoping someone could shed some light on what I'm doing wrong.
Previous post:
Use mechanize to submit form without control name
Here is my code:
import urllib2
from bs4 import BeautifulSoup
import cookielib
import urllib
import requests
import mechanize
# Log-in to wsj.com
cj = mechanize.CookieJar()
br = mechanize.Browser()
br.set_cookiejar(cj)
br.set_handle_robots(False)
# The site we will navigate into, handling it's session
br.open('https://id.wsj.com/access/pages/wsj/us/login_standalone.html?mg=id-wsj')
# Select the first (index zero) form
br.select_form(nr=0)
# Returns None
forms = [f for f in br.forms()]
print forms[0].controls[6].name
# Returns set_value not defined
set_value(value,
name=None, type=None, kind=None, id=None, nr=None,
by_label=False, # by_label is deprecated
label=None)
forms[0].set_value("LOOK!!!! I SET THE VALUE OF THIS UNNAMED CONTROL!",
nr=6)
control.readonly = False
control.disabled = True
The control which value you are trying to set is actually a button, submit button:
print [(control.name, control.type) for control in forms[0].controls]
prints:
[('landing_page', 'hidden'),
('login_realm', 'hidden'),
('login_template', 'hidden'),
('username', 'text'),
('password', 'password'),
('savelogin', 'checkbox'),
(None, 'submit')]
And, you cannot use set_value() for a submit button:
submit_button = forms[0].controls[6]
print submit_button.set_value('test')
results into:
AttributeError: SubmitControl instance has no attribute 'set_value'
Hope that helps.
Related
So I'm doing a relatively simple project so I can teach myself Python. I've come to a point where I'm stuck. So I have a variable named element in pycharm debugger which shows as
This variable is type Tag, which is correct to me. In element I want to see if the class="schedule_dgrd_time/result"which is not the case in the above image.
I see that within element there is an attrs.
How can I access that value? If I do element.string I get the text value which in this case would be Sat.(...I could make that work), but I was wondering if I can check the class attribute value first.
I've been searching for this for a couple days now and just can't get it. I've googled myself to death at this point. Any help or pointers would be greatly appreciated. Thanks for reading.
Update
Here is my code
import urllib2
import datetime
import re
from bs4 import BeautifulSoup
# today's date
date = datetime.datetime.today().strftime('%-m/%d/%Y')
validDay = "Mon\.|Tue\.|Wed\.|Thu(r)?(s)?\.|Fri\."
website = "http://www.texassports.com/schedule.aspx?path=baseball"
opener = urllib2.build_opener()
##add headers that make it look like I'm a browser
opener.addheaders = [('User-Agent', 'Mozilla/5.0')]
page = opener.open(website)
# turn page into html object
soup = BeautifulSoup(page, 'html.parser')
#print soup.prettify()
#get all home games
all_rows = soup.find_all('tr', class_='schedule_home_tr')
# see if any game is today
# entryForToday = [t for t in all_rows if t.findAll('nobr',text=re.compile('.*({}).*'.format(date)))]
# hard coding for testing weekend
entryForToday = [t for t in all_rows if t.findAll('nobr',text=re.compile('3/11/2017'))]
time = "schedule_dgrd_time/result"
for elements in entryForToday:
for element in elements:
#this is where I'm stuck.
# if element.attrs:
# print element.attrs['class'][0]
I know a double nested for loop is not ideal so if you have a better way I'm glad to hear it. Thanks
So I was able to figure out. I have some NavigableString which doesn't have attrs so that was throwing an error. element.attrs['class'][0] does work now. I had to check if isinstanceOf a tag, if not it would skip it. Anywho, my code is below for anyone that is interested.
import urllib2
import datetime
import re
from bs4 import BeautifulSoup
from bs4 import Tag
# today's date
date = datetime.datetime.today().strftime('%-m/%d/%Y')
validDay = "Mon\.|Tue\.|Wed\.|Thu(r)?(s)?\.|Fri\."
website = "http://www.texassports.com/schedule.aspx?path=baseball"
opener = urllib2.build_opener()
##add headers that make it look like I'm a browser
opener.addheaders = [('User-Agent', 'Mozilla/5.0')]
page = opener.open(website)
# turn page into html object
soup = BeautifulSoup(page, 'html.parser')
#print soup.prettify()
#get all home games
all_rows = soup.find_all('tr', class_='schedule_home_tr')
# see if any game is today
# entryForToday = [t for t in all_rows if t.findAll('nobr',text=re.compile('.*({}).*'.format(date)))]
# hard coding for testing weekend
entryForToday = [t for t in all_rows if t.findAll('nobr',text=re.compile('3/14/2017'))]
classForTime = "schedule_dgrd_time/result"
timeOfGame = "none";
if entryForToday:
entryForToday = [t for t in entryForToday if t.findAll('td',
class_='schedule_dgrd_game_day_of_week',
text=re.compile('.*({}).*'.format(validDay)))]
if entryForToday:
for elements in entryForToday:
for element in elements:
if isinstance(element, Tag):
if element.attrs['class'][0] == classForTime:
timeOfGame = element.text
# print element.text
break
print timeOfGame
I try to find a way to scrape and parse more pages in the signed in area.
These example links accesible from signed in I want to parse.
#http://example.com/seller/demand/?id=305554
#http://example.com/seller/demand/?id=305553
#http://example.com/seller/demand/?id=305552
#....
I want to create spider that can open each one of these links and then parse them.
I have created another spider which can open and parse only one of them.
When I tried to create "for" or "while" to call more requests with other links it allowed me not because I cannot put more returns into generator, it returns error. I also tried link extractors, but it didn't work for me.
Here is my code:
#!c:/server/www/scrapy
# -*- coding: utf-8 -*-
from scrapy import Spider
from scrapy.selector import Selector
from scrapy.http import FormRequest
from scrapy.http.request import Request
from scrapy.spiders import CrawlSpider, Rule
from array import *
from stack.items import StackItem
from scrapy.linkextractors import LinkExtractor
class Spider3(Spider):
name = "Spider3"
allowed_domains = ["example.com"]
start_urls = ["http://example.com/login"] #this link lead to login page
When I am signed in it returns page with url, that contains "stat", that is why I put here first "if" condition.
When I am signed in, I request one link and call function parse_items.
def parse(self, response):
#when "stat" is in url it means that I just signed in
if "stat" in response.url:
return Request("http://example.com/seller/demand/?id=305554", callback = self.parse_items)
else:
#this succesful login turns me to page, it's url contains "stat"
return [FormRequest.from_response(response,
formdata={'ctl00$ContentPlaceHolder1$lMain$tbLogin': 'my_login', 'ctl00$ContentPlaceHolder1$lMain$tbPass': 'my_password'},callback=self.parse)]
Function parse_items simply parse desired content from one desired page:
def parse_items(self,response):
questions = Selector(response).xpath('//*[#id="ctl00_ContentPlaceHolder1_cRequest_divAll"]/table/tr')
for question in questions:
item = StackItem()
item['name'] = question.xpath('th/text()').extract()[0]
item['value'] = question.xpath('td/text()').extract()[0]
yield item
Can you help me please to update this code to open and parse more than one page in each sessions?
I don't want to sign in over and over for each request.
The session most likely depends on the cookies and scrapy manages that by itself. I.e:
def parse_items(self,response):
questions = Selector(response).xpath('//*[#id="ctl00_ContentPlaceHolder1_cRequest_divAll"]/table/tr')
for question in questions:
item = StackItem()
item['name'] = question.xpath('th/text()').extract()[0]
item['value'] = question.xpath('td/text()').extract()[0]
yield item
next_url = '' # find url to next page in the current page
if next_url:
yield Request(next_url, self.parse_items)
# scrapy will retain the session for the next page if it's managed by cookies
I am currently working on the same problem. I use InitSpider so I can overwrite __init__ and init_request. The first is just for initialisation of custom stuff and the actual magic happens in my init_request:
def init_request(self):
"""This function is called before crawling starts."""
# Do not start a request on error,
# simply return nothing and quit scrapy
if self.abort:
return
# Do a login
if self.login_required:
# Start with login first
return Request(url=self.login_page, callback=self.login)
else:
# Start with pase function
return Request(url=self.base_url, callback=self.parse)
My login looks like this
def login(self, response):
"""Generate a login request."""
self.log('Login called')
return FormRequest.from_response(
response,
formdata=self.login_data,
method=self.login_method,
callback=self.check_login_response
)
self.login_data is a dict with post values.
I am still a beginner with python and scrapy, so I might be doing it the wrong way. Anyway, so far I have produced a working version that can be viewed on github.
HTH:
https://github.com/cytopia/crawlpy
I am able to scrape all the stories from the first page,my problem is how to move to the next page and continue scraping stories and name,kindly check my code below
# -*- coding: utf-8 -*-
import scrapy
from cancerstories.items import CancerstoriesItem
class MyItem(scrapy.Item):
name = scrapy.Field()
story = scrapy.Field()
class MySpider(scrapy.Spider):
name = 'cancerstories'
allowed_domains = ['thebreastcancersite.greatergood.com']
start_urls = ['http://thebreastcancersite.greatergood.com/clickToGive/bcs/stories/']
def parse(self, response):
rows = response.xpath('//a[contains(#href,"story")]')
#loop over all links to stories
for row in rows:
myItem = MyItem() # Create a new item
myItem['name'] = row.xpath('./text()').extract() # assign name from link
story_url = response.urljoin(row.xpath('./#href').extract()[0]) # extract url from link
request = scrapy.Request(url = story_url, callback = self.parse_detail) # create request for detail page with story
request.meta['myItem'] = myItem # pass the item with the request
yield request
def parse_detail(self, response):
myItem = response.meta['myItem'] # extract the item (with the name) from the response
#myItem['name']=response.xpath('//h1[#class="headline"]/text()').extract()
text_raw = response.xpath('//div[#class="photoStoryBox"]/div/p/text()').extract() # extract the story (text)
myItem['story'] = ' '.join(map(unicode.strip, text_raw)) # clean up the text and assign to item
yield myItem # return the item
You could change your scrapy.Spider for a CrawlSpider, and use Rule and LinkExtractor to follow the link to the next page.
For this approach you have to include the code below:
...
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
...
rules = (
Rule(LinkExtractor(allow='\.\./stories;jsessionid=[0-9A-Z]+?page=[0-9]+')),
)
...
class MySpider(CrawlSpider):
...
This way, for each page you visit the spider will create a request for the next page (if present), follow it when finishes the execution for the parse method, and repeat the process again.
EDIT:
The rule I wrote is just to follow the next page link not to extract the stories, if your first approach works it's not necessary to change it.
Also, regarding the rule in your comment, SgmlLinkExtractor is deprecated so I recommend you to use the default link extractor, and the rule itself is not well defined.
When the parameter attrs in the extractor is not defined, it searchs links looking for the href tags in the body, which in this case looks like ../story/mother-of-4435 and not /clickToGive/bcs/story/mother-of-4435. That's the reason it doesn't find any link to follow.
you can follow next pages manually if you would use scrapy.spider class,example:
next_page = response.css('a.pageLink ::attr(href)').extract_first()
if next_page:
absolute_next_page_url = response.urljoin(next_page)
yield scrapy.Request(url=absolute_next_page_url, callback=self.parse)
Do not forget to rename your parse method to parse_start_url if you want to use CralwSpider class
I am using PyQt4 for creating a custom browser using QtWebKit, but I am stuck on saving bookmarks from the browser. Does anyone know how to achieve that?
You're a little vague on how you want this done, so I'll say we wanted to use a button imported from a UI file called bookmarks_Btn. You'll need to use the pickle module.
Here's the example code...
from PyQt4 import QtCore, QtGui, QtWebKit, uic
import pickle
class window(QtGui.QWidget):
def __init__(self, parent=None):
super(httpWidget, self).__init__(parent)
self.ui = uic.loadUi('mybrowser.ui')
self.ui.setupUi(self)
def bookmarksLoad(self):
print 'Loading bookmarks'
try:
bookOpen = open('bookmarks.txt', 'rb')
bookmarks = pickle.load(bookOpen)
bookOpen.close()
print bookmarks # Not necessary, but for example purposes
# Here you decide how "bookmarks" variable is displayed.
except:
bookOpen = open('bookmarks.txt', 'wb')
bookmarks = 'http://www.stackoverflow.com'
bookWrite = pickle.dump(bookmarks, bookOpen)
bookOpen.close()
print bookmarks # Not necessary, but for example purposes
# Here you decide how "bookmarks" variable is displayed.
QtCore.QObject.connect(self.ui.bookmarks_Btn, QtCore.SIGNAL('clicked()'), self.bookmarksLoad)
self.ui.show()
def bookmarks():
url = input 'Enter a URL: '
bookOpen = open('bookmarks.txt', 'wb')
bookOpen.write(url)
bookOpen.close()
print 'Website bookmarked!'
if __name__ == '__main__':
app = QtGui.QApplication(sys.argv)
run = window()
bookmarks()
sys.exit(app.exec_())
# You add on here, for example, deleting bookmarks.
However, if you wanted it to be retrieved from an address bar (named address, make the following changes...
# In the bookmarks function...
global url # Add at beginning
# Remove the input line.
# Add at end of __init__ in window class:
url = self.ui.address.text()
global url
That's pretty much the basics. Please note I normally program in Python 3 and PyQt5 so if there are any errors let me know :)
Hi all reportlab master,
I've search the web and also here in stackoverflow but can't find a similar situation on my problem I'm trying to solve during this Holiday vacation.
In django admin, I'm trying to create an action to view my database to a specific format. If I select one record I can view the report in one page pdf. Which is fine. In case the user try to more record that's where the problem start. For example I select multiple record, I can view the report but all content is still in one page pdf.
Is there a way to show a record per page in pdf? All reportlab master jedi, Please help me how to do this the right way.
Here's my code on what I did.
from django.contrib import admin
from models import LatestRsl
from io import BytesIO
from reportlab.pdfgen import canvas
from django.http import HttpResponse
try:
from cStringIO import StringIO
except ImportError:
from StringIO import StringIO
from reportlab.lib.units import inch
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.platypus import BaseDocTemplate, PageTemplate, Paragraph, Frame
from reportlab.lib.pagesizes import letter
def go(modeladmin, request, queryset):
response = HttpResponse(mimetype='application/pdf')
response['Content-Disposition'] = 'filename = testframe.pdf'
buffer = StringIO()
c = canvas.Canvas(buffer)
doc = BaseDocTemplate(buffer, showBoundary=1, leftMargin= 0.1*inch, rightMargin= 0.1*inch,
topMargin= 0.1*inch, bottomMargin= 0.1*inch)
signfr = Frame(5.1*inch, 1.2*inch, 2.8*inch, 0.44*inch, showBoundary=1)
modelfr = Frame(3.6*inch, 4.6*inch, 2.8*inch, 0.44*inch, showBoundary=1)
doc.addPageTemplates([PageTemplate(id= 'rsl_frame', frames=[signfr, modelfr]),
PageTemplate(id= 'rsl_frame2', frames=[signfr, modelfr])])
story = []
styles=getSampleStyleSheet()
styles.add(ParagraphStyle(name='Verdana9', fontName= 'Verdana', fontSize= 9))
styles.add(ParagraphStyle(name='VerdanaB10', fontName= 'VerdanaB', fontSize= 10))
for obj in queryset:
#1st frame
model = Paragraph(obj.make,styles["Verdana9"])
story.append(model)
modelfr.addFromList(story,c)
#2nd frame
signatory = Paragraph(obj.signatory,styles["VerdanaB10"])
story.append(signatory)
signfr.addFromList(story,c)
doc.build(story)
c.showPage()
c.save()
pdf = buffer.getvalue()
buffer.close()
response.write(pdf)
return response
Assuming your queryset variable contains all the records you need, you could insert a PageBreak object. Just add from reportlab.platypus import PageBreak to the top of your file, then append a PageBreak object to your document's elements.
If you want to change the template for each page, you can also append a NextPageTemplate and pass the id of your PageTemplate. You'll need to add from reportlab.platypus import NextPageTemplate to the top of your file as well.
for obj in queryset:
#1st frame
model = Paragraph(obj.make,styles["Verdana9"])
story.append(model)
modelfr.addFromList(story,c)
#2nd frame
signatory = Paragraph(obj.signatory,styles["VerdanaB10"])
story.append(signatory)
signfr.addFromList(story,c)
# Force the report to use a different PageTemplate on the next page
story.append(NextPageTemplate('rsl_frame2'))
# Start a new page for the next object in the query
story.append(PageBreak())
You could move the PageBreak wherever you need it, but it's a simple "function" flowable. NextPageTemplate can take the id of any valid PageTemplate object that you've added via addPageTemplates.