Scrapy parse iframe url - python-2.7

I am parsing the links off from a website, then trying to parse those links for the iframe src.
It looks like according to the DEBUG that the first links are being parsed correctly, but I am not getting any data in my output file.
Is it also possible to remove everything after the ? in the URL. This
looks like embeded iframe information.
I am running Centos 6.5 Python 2.7.5
scrapy runspider new.py -o videos.csv
import scrapy
class PdgaSpider(scrapy.Spider):
name = "pdgavideos"
start_urls = ["http://www.pdga.com/videos/"]
def parse(self, response):
for link in response.xpath('//td[2]/a/#href').extract():
from scrapy.http.request import Request
yield Request('http://www.pdga.com'+link, callback=self.parse_page, meta={'link':link})
def parse_page(self, response):
for frame in response.xpath("//player").extract():
yield {
'link': response.urljoin(frame)
}
Debug results
DEBUG: Crawled (200) <GET http://www.pdga.com/videos/2017-gbo-final-round-front-9-sexton-mcbeth-mccray-newhouse> (referer: http://www.pdga.com/videos/)
DEBUG: Crawled (200) <GET http://www.pdga.com/videos/2017-glass-blown-open-fpo-rd-2-pt-1-pierce-fajkus-leatherman-c-allen-sexton-leatherman> (referer: http://www.pdga.com/videos/)
DEBUG: Crawled (200) <GET http://www.pdga.com/videos/2017-gbo-final-round-back-9-sexton-mcbeth-mccray-newhouse> (referer: http://www.pdga.com/videos/)
Expected results
http://www.youtube.com/embed/tYBF-BaqVJ8

Scrapy doese not scrape the content of the iFrames, but you can get them. First get the iframe url, then call parse on it.
urls = response.css('iframe::attr(src)').extract()
for url in urls :
yield scrapy.Request(url....)

Related

Run CrawlerProcess in Scrapy with Splash

I have a scrapy+splash file to crawl data. Now I want to run my scrapy file by script so I use CrawlerProcess. My file is like this:
import scrapy
from scrapy_splash import SplashRequest
from scrapy.crawler import CrawlerProcess
class ProvinceSpider(scrapy.Spider):
name = 'province'
def start_requests(self):
url = "https://e.vnexpress.net/covid-19/vaccine"
yield SplashRequest(url=url,callback=self.parse)
def parse(self, response):
provinces = response.xpath("//div[#id='total_vaccine_province']/ul[#data-weight]")
for province in provinces:
yield{
'province_name':province.xpath(".//li[1]/text()").get(),
'province_population':province.xpath(".//li[2]/text()").get(),
'province_expected_distribution':province.xpath(".//li[3]/text()").get(),
'province_actual_distribution':province.xpath(".//li[4]/text()").get(),
'province_distribution_percentage':province.xpath(".//li[5]/div/div/span/text()").get(),
}
process = CrawlerProcess(settings={
"FEEDS": {
"province.json": {"format": "json"},
},
})
process.crawl(ProvinceSpider)
process.start() # the script will block here until the crawling is finished
But when I run
python3 province.py
It doesn't connect to splash server thus can't crawl data. Any idea about which part I do wrong? Tks in advance
Turns out the issue you actually experienced has been covered by the following answer here: Answer
A quick break-down (if you're not interested in the details):
Go to settings.py and add a USER-AGENT, in my case I left it as:
USER_AGENT = 'testit (http://www.yourdomain.com)'
Then run your crawler and it should work. Why? your scrapy is being blocked by the site.
Output:
2021-12-26 13:15:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://e.vnexpress.net/covid-19/vaccine>
{'province_name': 'HCMC', 'province_population': '7.2M', 'province_expected_distribution': '13.8M', 'province_actual_distribution': '14.6M', 'province_distribution_percentage': '100%'}
2021-12-26 13:15:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://e.vnexpress.net/covid-19/vaccine>
{'province_name': 'Hanoi', 'province_population': '6.2M', 'province_expected_distribution': '11.4M', 'province_actual_distribution': '12.3M', 'province_distribution_percentage': '99,2%'}
2021-12-26 13:15:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://e.vnexpress.net/covid-19/vaccine>
{'province_name': 'Dong Nai', 'province_population': '2.4M', 'province_expected_distribution': '4.3M', 'province_actual_distribution': '5M', 'province_distribution_percentage': '100%'}
...
...
Here's my custom settings:
BOT_NAME = 'testing'
SPIDER_MODULES = ['testing.spiders']
NEWSPIDER_MODULE = 'testing.spiders'
SPLASH_URL = 'http://localhost:8050'
USER_AGENT = 'testing (http://www.yourdomain.com)'
ROBOTSTXT_OBEY = False
DEFAULT_REQUEST_HEADERS = {
'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.2 Safari/605.1.15'
}
SPIDER_MIDDLEWARES = {
'testing.middlewares.TestingSpiderMiddleware': 100,
}
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

scrapy shell enable javascript

I am trying to get the response.body of https://www.wickedfire.com/ in scrapy shell.
but in the response.body it tells me:
<html><title>You are being redirected...</title>\n<noscript>Javascript is required. Please enable javascript before you are allowed to see this page...
How do i activate the javascript? Or is there something else that i can do?
Thank you in advance
UPDATE:
i ve installed pip install scrapy-splash
and i put those commands in settings.py
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPLASH_URL = 'http://localhost:8050/'
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
It does give me an error:
NameError: Module 'scrapy_splash' doesn't define any object named 'SplashCoockiesMiddleware'
I have put it as a comment after that error.and it passed.
And my script is like this...but it doesn't work
...
from scrapy_splash import SplashRequest
...
start_urls = ['https://www.wickedfire.com/login.php?do=login']
payload = {'vb_login_username':'','vb_login_password':''}
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse,args={'wait':1})
def parse(self, response):
# url = "https://www.wickedfire.com/login.php?do=login"
r = SplashFormRequest(response,formdata=payload,callback=self.after_login)
return r
def after_login(self,response):
print response.body + "THIS IS THE BODY"
if "incorrect" in response.body:
self.logger.error("Login failed")
return
else:
results = FormRequest.from_response(response,
formdata={'query': 'bitter'},
callback=self.parse_page)
return results
...
This is the error that i get:
[scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://wickedfire.com/ via http://localhost:8050/render.html> (failed 1 times): 502 Bad Gateway
[scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://wickedfire.com/ via http://localhost:8050/render.html> (failed 2 times): 502 Bad Gateway
[scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://wickedfire.com/ via http://localhost:8050/render.html> (failed 3 times): 502 Bad Gateway
[scrapy.core.engine] DEBUG: Crawled (502) <GET https://wickedfire.com/ via http://localhost:8050/render.html> (referer: None) ['partial']
[scrapy.spidermiddlewares.httperror] INFO: Ignoring response <502 https://wickedfire.com/>: HTTP status code is not handled or not allowed
i also tried scrapy splash with scrapy shell using this Guide
I just want to login to the page and put in a keyword to be search and get the results. This is my end results.

scrapy "Missing scheme in request url"

Here's my code below-
import scrapy
from scrapy.http import Request
class lyricsFetch(scrapy.Spider):
name = "lyricsFetch"
allowed_domains = ["metrolyrics.com"]
print "\nEnter the name of the ARTIST of the song for which you want the lyrics for. Minimise the spelling mistakes, if possible."
artist_name = raw_input('>')
print "\nNow comes the main part. Enter the NAME of the song itself now. Again, try not to have any spelling mistakes."
song_name = raw_input('>')
artist_name = artist_name.replace(" ", "_")
song_name = song_name.replace(" ","_")
first_letter = artist_name[0]
print artist_name
print song_name
start_urls = ["www.lyricsmode.com/lyrics/"+first_letter+"/"+artist_name+"/"+song_name+".html" ]
print "\nParsing this link\t "+ str(start_urls)
def start_requests(self):
yield Request("www.lyricsmode.com/feed.xml")
def parse(self, response):
lyrics = response.xpath('//p[#id="lyrics_text"]/text()').extract()
with open ("lyrics.txt",'wb') as lyr:
lyr.write(str(lyrics))
#yield lyrics
print lyrics
I get the correct output when I use the scrapy shell, however, whenever I try to run the script using scrapy crawl I get the ValueError. What am I doing wrong? I went through this site, and others, and came up with nothing. I got the idea of yielding a request through another question over here, but it still didn't work.
Any help?
My traceback-
Enter the name of the ARTIST of the song for which you want the lyrics for. Minimise the spelling mistakes, if possible.
>bullet for my valentine
Now comes the main part. Enter the NAME of the song itself now. Again, try not to have any spelling mistakes.
>your betrayal
bullet_for_my_valentine
your_betrayal
Parsing this link ['www.lyricsmode.com/lyrics/b/bullet_for_my_valentine/your_betrayal.html']
2016-01-24 19:58:25 [scrapy] INFO: Scrapy 1.0.3 started (bot: lyricsFetch)
2016-01-24 19:58:25 [scrapy] INFO: Optional features available: ssl, http11
2016-01-24 19:58:25 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'lyricsFetch.spiders', 'SPIDER_MODULES': ['lyricsFetch.spiders'], 'BOT_NAME': 'lyricsFetch'}
2016-01-24 19:58:27 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-01-24 19:58:28 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-01-24 19:58:28 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-01-24 19:58:28 [scrapy] INFO: Enabled item pipelines:
2016-01-24 19:58:28 [scrapy] INFO: Spider opened
2016-01-24 19:58:28 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-01-24 19:58:28 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-01-24 19:58:28 [scrapy] ERROR: Error while obtaining start requests
Traceback (most recent call last):
File "C:\Users\Nishank\Miniconda2\lib\site-packages\scrapy\core\engine.py", line 110, in _next_request
request = next(slot.start_requests)
File "C:\Users\Nishank\Desktop\SNU\Python\lyricsFetch\lyricsFetch\spiders\lyricsFetch.py", line 26, in start_requests
yield Request("www.lyricsmode.com/feed.xml")
File "C:\Users\Nishank\Miniconda2\lib\site-packages\scrapy\http\request\__init__.py", line 24, in __init__
self._set_url(url)
File "C:\Users\Nishank\Miniconda2\lib\site-packages\scrapy\http\request\__init__.py", line 59, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: www.lyricsmode.com/feed.xml
2016-01-24 19:58:28 [scrapy] INFO: Closing spider (finished)
2016-01-24 19:58:28 [scrapy] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 1, 24, 14, 28, 28, 231000),
'log_count/DEBUG': 1,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'start_time': datetime.datetime(2016, 1, 24, 14, 28, 28, 215000)}
2016-01-24 19:58:28 [scrapy] INFO: Spider closed (finished)
As #tintin said, you are missing the http scheme in the URLs. Scrapy needs fully qualified URLs in order to process the requests.
As far I can see, you are missing the scheme in:
start_urls = ["www.lyricsmode.com/lyrics/ ...
and
yield Request("www.lyricsmode.com/feed.xml")
In case you are parsing URLs from the HTML content, you should use urljoin to ensure you get a fully qualified URL, for example:
next_url = response.urljoin(href)
I also encountered this problem today, URL usually has a scheme, which is very common, such as HTTP, HTTPS in url .
It should be that urls you extract from start_url response without HTTP, HTTPS such as //list.jd.com/list.html.
You should add the scheme in url It should be https://list.jd.com/list.html

scrapy:how to skip the urls which don't response?

scrapy crawl some pages, but every time the crawler stopped when it crawl the one of them ,and i find the reason is:there are some urls don't response,open the urls with browser,there is noting ,is blank,not 404.and this cause the crawler stopped. what should I do? English is not my mother language,I am not sure I describe it clearly.
add some description:
the crawler is writed by scrapy+redis+mongodb.and there are about 70 list-page,every list-page has 10 urls to detail-page,so,correct total pages are about 70*10=700,but when the pages are about 400, crawler can't get any more pages,the infomation is:
2015-10-14 22:28:13 [scrapy] INFO: Crawled 1192 pages (at 76 pages/min), scraped 443 items (at 35 items/min)
2015-10-14 22:29:13 [scrapy] INFO: Crawled 1192 pages (at 0 pages/min), scraped 443 items (at 0 items/min)
"Crawled 1192 pages":there are some ajax requests and list-page requests,so the pages is more than 700.
and i find the reason causing the crawler stopped is that there are some urls don't response.open the urls with browser,there is noting ,is blank,not 404.and,these pages that no response ,i want to ignore them,continue to crawl next page.
#Shekhar Samanta said:
"use :-
try:
your line to make http requests
except:
pass
with the help of this your crawler will not break ."
so, this is the code about http requests of this crawler,i don't know how to add "try except" to my code:
def parse(self,response):
url_list = response.xpath('//div[#class="title"]/a/#href')
for url in url_list:
fullurl=response.urljoin(url.extract())
yield Request(fullurl, callback=self.parseContent)
def parseContent(self, response):
for sel in response.xpath('//div[#class="content"]'):
item = ArticlespiderItem()
item['articleUrl']=response.url
item['aticleTitle']=sel.xpath('div[1]/div[1]/h3/text()').extract()
yield item
use :-
try:
your line to make http requests
except:
pass
with the help of this your crawler will not break .

Scrapy removes query strings in response

In scrapy shell when I tried using fetch on a Google' search result page:
$ scrapy shell "http://www.google.com/?gws_rd=ssl#q=%22german+beer+near%22&start=0"
I got a response without the query string after '#'
[s] request <GET http://www.google.com/?gws_rd=ssl#q=jeffrey+m+liebmann>
[s] response <200 http://www.google.com/?gws_rd=ssl>
Is this issue belongs to scrapy or Google? Tried pasting the whole URL + query string and Google led me to the results just fine.
Google switched from http to https awhile ago, so you just need to simply remove gws_rd=ssl from your request url. Try this:
scrapy shell "https://www.google.fr/search?q=%22german+beer+near%22&start=0"
and the respose will be:
[s] request <GET https://www.google.fr/search?q=%22german+beer+near%22&start=0>
[s] response <200 https://www.google.fr/search?q=%22german+beer+near%22&start=0>