If Statement Not Working For Scraper

If Statement Not Working For Scraper - if-statement

I'm hoping you can show me where i'm going wrong with my webscraper.
What I would like to do is to be notified when a certain string ("Sorry, Gruen Fan") changes on a page. I'm able to pull in the string, however, the "If" function doesn't seem to work - its output should be "Text is in". Here's the code:
from bs4 import BeautifulSoup
from urllib import urlopen
import re
urls= ["http://www.abc.net.au/tv/programs/gruen-nation/"]
for url in urls:
webpage = urlopen(url).read()
FindTitle = re.compile('\t\t\t\t(.*)\.<BR><BR>')
FindTitle = re.findall(FindTitle,webpage)
print FindTitle[0]
print ' '
if 'Sorry, Gruen fan' in FindTitle:
print("Text is in")
else:
print("Text isn't in")
Thanks in advance for your time,
Sam.

FindTitle is a list. The string isn't in the list, so you get False.
You should check if it's in the string in the list instead:
if 'Sorry, Gruen fan' in FindTitle[0]:
Also, you don't need regex if you just want to check for a string:
from urllib import urlopen
urls = ["http://www.abc.net.au/tv/programs/gruen-nation/"]
for url in urls:
html = urlopen(url).read()
if 'Sorry, Gruen fan' in html:
print("Text is in")
else:
print("Text isn't in")

Related

BeautifulSoup web table scraping

from urllib2 import urlopen, Request
from bs4 import BeautifulSoup
site = 'https://racing.hkjc.com/racing/information/English/racing/LocalResults.aspx/'
hdr = {'User-Agent': 'Mozilla/5.0'}
req = Request(site, headers=hdr)
res = urlopen(req)
rawpage = res.read()
page = rawpage.replace("<!-->", "")
soup = BeautifulSoup(page, "html.parser")
table = soup.find("table", {"class":"f_tac table_bd draggable"})
print (table)
this work perfectly got a table output, untill i change the url to next page there is nothing to output (None)
'https://racing.hkjc.com/racing/information/English/Racing/LocalResults.aspx?RaceDate=2020/03/14&Racecourse=ST&RaceNo=2'
please help what's wrong of the url or the code?

you must add query string to end of url:
example:
to fetch table from page 2:
site ='https://racing.hkjc.com/racing/information/English/racing/LocalResults.aspx/?RaceDate=2020/03/14&Racecourse=ST&RaceNo=2'

How to send Email with html page in Django?

I'm new in Django ! I don't know how to send email in Django. I refer Django documentation but it didn't help me . I need to send email with html page to different users .In models.py i have two values Name and Email. When i click button ,the html page should be send to appropriate user's Email

There are a lot of different solutions how to send emails in django.
You can use even php, or any scripting language if you feel it's complicated to use only python/django code.
Just an example of email utility from custom email subscription:
email_utility.py:
import logging, traceback
from django.urls import reverse
import requests
from django.template.loader import get_template
from django.utils.html import strip_tags
from django.conf import settings
def send_email(data):
try:
url = "https://api.mailgun.net/v3/<domain-name>/messages"
status = requests.post(
url,
auth=("api", settings.MAILGUN_API_KEY),
data={"from": "YOUR NAME <admin#domain-name>",
"to": [data["email"]],
"subject": data["subject"],
"text": data["plain_text"],
"html": data["html_text"]}
)
logging.getLogger("info").info("Mail sent to " + data["email"] + ". status: " + str(status))
return status
except Exception as e:
logging.getLogger("error").error(traceback.format_exc())
return False
Don't forget to create a token which we will verify when user clicks the confirmation link. Token will be encrypted so that no one can tamper the data.
token = encrypt(email + constants.SEPARATOR + str(time.time()))
Also check this link and this.

Here is a naive exemple to leverage django send_mail:
import smtplib
from django.core.mail import send_mail
from django.utils.html import strip_tags
from django.template.loader import render_to_string
#user will be a queryset like:
users = User.objects.all() # or more specific query
subject = 'Subject'
from_email = 'from#xxx.com'
def send_email_to_users(users,subject,from_email):
full_traceback = []
for user in users:
to = [user.email] # list of people you want to sent mail to.
html_content = render_to_string('mail_template.html', {'title':'My Awesome email title', 'content' : 'Some email content', 'username':user.username}) # render with dynamic context you can retrieve in the html file
traceback = {}
try:
send_mail(subject,strip_tags(html_content),from_email, to, html_message=html_content, fail_silently=False)
traceback['status'] = True
except smtplib.SMTPException as e:
traceback['error'] = '%s (%s)' % (e.message, type(e))
traceback['status'] = False
full_traceback.append(traceback)
errors_to_return = []
error_not_found = []
for email in full_traceback:
if email['status']:
error_not_found.append(True)
else:
error_not_found.append(False)
errors_to_return.append(email['error'])
if False in error_not_found:
error_not_found = False
else:
error_not_found = True
return (error_not_found, errors_to_return)
#really naive view using the function on top
def my_email_view(request,user_id):
user = get_object_or_404(User, pk=user_id)
subject = 'Subject'
from_email = 'myemail#xxx.com'
email_sent, traceback = send_email_to_users(user, subject, from_email)
if email_sent:
return render(request,'sucess_template.html')
return render(request,'fail_template.html',{'email_errors' : traceback})
In your template mail_template.html:
<h1>{{title}}</h1>
<p>Dear {{username}},</p>
<p>{{content}}</p>
And don't forget to set the email settings in settings.py: https://docs.djangoproject.com/fr/2.2/ref/settings/#email-backend
Send_mail from docs :https://docs.djangoproject.com/fr/2.2/topics/email/#send-mail
Render_to_string from the doc: https://docs.djangoproject.com/fr/2.2/topics/templates/#django.template.loader.render_to_string

Find element text using xpath in selenium-python NOt Working

html looks like thisI have written this code to scrape all courses from a url. For this I am trying to get the count of courses using xpath. But it does not give me anything. Where I am doing wrong?
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from time import sleep
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait`
class FinalProject:
def __init__(self,url= "https://www.class-central.com/subject/data-science"):`
self.url = url
base_url = 'https://www.class-central.com'
self.error_flag = False
self.driver = webdriver.Chrome(<path to chromedriver>)
self.driver.get(self.url)
sleep(2)
self.count_course_and_scroll()
def count_course_and_scroll(self):
wait = WebDriverWait(self.driver, 30);
ele = wait.until(EC.presence_of_element_located((By.PARTIAL_LINK_TEXT, 'Not right now, thanks.')));
ele.click()
print "----------------------POP UP CLOSED---------------------"
total_courses = self.driver.find_element_by_xpath("//span[#id='number-of-courses']")
print total_courses
print total_courses.text
self.driver.close()
fp = FinalProject()

If text doesn't work you can try get_attribute
print total_courses.get_attribute('text')
#or
print total_courses.get_attribute('innerHTML')

ele = wait.until(EC.presence_of_element_located((By.PARTIAL_LINK_TEXT, 'Not right now, thanks.')));
ele.click()
print "----------------------POP UP CLOSED---------------------"
total_courses = self.driver.find_element_by_xpath("//span[#id='number-of-courses']")
In that piece of code, I suspicious 2 things:
Does the popup always appear?
Does the text of number-of-coursesshow in time?
If you are not sure about 1., I would recommend to put it in a try catch
And about 2. - wait until some text appears on that element
try:
ele = wait.until(EC.presence_of_element_located((By.PARTIAL_LINK_TEXT, 'Not right now, thanks.')));
ele.click()
finally:
total_courses = wait.until(EC.presence_of_element_located(By.XPATH, "//span[#id='number-of-courses' and text() != '']")
print total_courses.text

How to scrape pages after login

I try to find a way to scrape and parse more pages in the signed in area.
These example links accesible from signed in I want to parse.
#http://example.com/seller/demand/?id=305554
#http://example.com/seller/demand/?id=305553
#http://example.com/seller/demand/?id=305552
#....
I want to create spider that can open each one of these links and then parse them.
I have created another spider which can open and parse only one of them.
When I tried to create "for" or "while" to call more requests with other links it allowed me not because I cannot put more returns into generator, it returns error. I also tried link extractors, but it didn't work for me.
Here is my code:
#!c:/server/www/scrapy
# -*- coding: utf-8 -*-
from scrapy import Spider
from scrapy.selector import Selector
from scrapy.http import FormRequest
from scrapy.http.request import Request
from scrapy.spiders import CrawlSpider, Rule
from array import *
from stack.items import StackItem
from scrapy.linkextractors import LinkExtractor
class Spider3(Spider):
name = "Spider3"
allowed_domains = ["example.com"]
start_urls = ["http://example.com/login"] #this link lead to login page
When I am signed in it returns page with url, that contains "stat", that is why I put here first "if" condition.
When I am signed in, I request one link and call function parse_items.
def parse(self, response):
#when "stat" is in url it means that I just signed in
if "stat" in response.url:
return Request("http://example.com/seller/demand/?id=305554", callback = self.parse_items)
else:
#this succesful login turns me to page, it's url contains "stat"
return [FormRequest.from_response(response,
formdata={'ctl00$ContentPlaceHolder1$lMain$tbLogin': 'my_login', 'ctl00$ContentPlaceHolder1$lMain$tbPass': 'my_password'},callback=self.parse)]
Function parse_items simply parse desired content from one desired page:
def parse_items(self,response):
questions = Selector(response).xpath('//*[#id="ctl00_ContentPlaceHolder1_cRequest_divAll"]/table/tr')
for question in questions:
item = StackItem()
item['name'] = question.xpath('th/text()').extract()[0]
item['value'] = question.xpath('td/text()').extract()[0]
yield item
Can you help me please to update this code to open and parse more than one page in each sessions?
I don't want to sign in over and over for each request.

The session most likely depends on the cookies and scrapy manages that by itself. I.e:
def parse_items(self,response):
questions = Selector(response).xpath('//*[#id="ctl00_ContentPlaceHolder1_cRequest_divAll"]/table/tr')
for question in questions:
item = StackItem()
item['name'] = question.xpath('th/text()').extract()[0]
item['value'] = question.xpath('td/text()').extract()[0]
yield item
next_url = '' # find url to next page in the current page
if next_url:
yield Request(next_url, self.parse_items)
# scrapy will retain the session for the next page if it's managed by cookies

I am currently working on the same problem. I use InitSpider so I can overwrite __init__ and init_request. The first is just for initialisation of custom stuff and the actual magic happens in my init_request:
def init_request(self):
"""This function is called before crawling starts."""
# Do not start a request on error,
# simply return nothing and quit scrapy
if self.abort:
return
# Do a login
if self.login_required:
# Start with login first
return Request(url=self.login_page, callback=self.login)
else:
# Start with pase function
return Request(url=self.base_url, callback=self.parse)
My login looks like this
def login(self, response):
"""Generate a login request."""
self.log('Login called')
return FormRequest.from_response(
response,
formdata=self.login_data,
method=self.login_method,
callback=self.check_login_response
)
self.login_data is a dict with post values.
I am still a beginner with python and scrapy, so I might be doing it the wrong way. Anyway, so far I have produced a working version that can be viewed on github.
HTH:
https://github.com/cytopia/crawlpy

test that an URL is matching the correct view

for these two django url patterns
(r'^articles/(\d{4})/$', 'news.views.year_archive'),
(r'^articles/2003/$', 'news.views.special_case_2003'),
the special_case_2003 view will never be called because of the wider pattern above it
how can i test (in tests.py) what view have been matched via an URL pattern to be sure my urls are matching the desired views

This won't let you match the raw regular expression, but it'll let you match an example of a pattern:
from django.core.urlresolvers import resolve
def test_foo(self):
func = resolve('/foo/').func
func_name = '{}.{}'.format(func.__module__, func.__name__)
self.assertEquals('your.module.view_name' func_name)

You should put the special case first:
(r'^articles/2003/$', 'news.views.special_case_2003'),
(r'^articles/(\d{4})/$', 'news.views.year_archive'),
The urls are evaluated from top to bottom thus rendering the first view for which an url matches. You can just test these urls by using them in your browser or you could write a specific test for them in tests.py.
For more information on how to test urls.py read https://docs.djangoproject.com/en/1.4/topics/testing/#testing-tools which explains both how you can check whether you get a 200 response and how you can test whether certain content is present.
Here is the canonical example:
>>> from django.test.client import Client
>>> c = Client()
>>> response = c.post('/login/', {'username': 'john', 'password': 'smith'})
>>> response.status_code
200
>>> response = c.get('/customer/details/')
>>> response.content
'<!DOCTYPE html...'

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

If Statement Not Working For Scraper - if-statement

Related

BeautifulSoup web table scraping

How to send Email with html page in Django?

Find element text using xpath in selenium-python NOt Working

How to scrape pages after login

test that an URL is matching the correct view

Categories

Resources