content empty when using scrapy

content empty when using scrapy - python-2.7

Thanks for everyone in advance.
I encountered a problem when using Scrapy on Python 2.7.
The webpage I tried to crawl is a discussion board for Chinese stock market.
When I tried to get the first number "42177" just under the banner of this page (the number you see on that webpage may not be the number you see in the picture shown here, because it represents the number of times this article has been read and is updated realtime...), I always get an empty content. I am aware that this might be the dynamic content issue, but yet don't have a clue how to crawl it properly.
The code I used is:
item["read"] = info.xpath("div[#id='zwmbti']/div[#id='zwmbtilr']/span[#class='tc1']/text()").extract()
I think the xpath is set correctly and I have checked the return value of this response and it indeed told me that there is nothing under this directory. Results shown here:'read': [u'<div id="zwmbtilr"></div>']
If it has something, there should be something between <div id="zwmbtilr"> and </div>.
Really appreciated if you guys share any thoughts on this!

I just opened your link in Firefox with NoScript enabled. There nothing inside the <div #id='zwmbtilr'></div>. If I enable the javascripts, I can see the content you want. So, as you already new, it is a dynamic content issue.
Your first option is try to identify the request generated by javascript. If you can do that, you can send the same request from scrapy. If you can't do it, the next option is usually to use some package with javascript/browser emulation or someting like that. Something like ScrapyJS or Scrapy + Selenium.

Related

How to link to html page at specific ID bookmark?

Using Diagrams.net (draw.io), I would like to link specific elements to web pages. This is easily accomplished currently by creating a link for the element (say a rectangle).
However, I would like to navigate directly to a specific id bookmark in the HTML page. I cannot seem to get that to work.
For example, if I try to use this syntax (which works in the browser location bar):
https://en.wikipedia.org/wiki/Canada#Geography
I will be taken to the main page:
https://en.wikipedia.org/wiki/Canada
However, the goal is to go to the "Geography" section of this page.
I have also tried the json syntax without any success:
data:action/json,{"actions":[{"open":"https://en.wikipedia.org/wiki/Canada#Geography"}]}
I have also played with different action syntax such as:
data:action/json,{"actions":[{"open":"https://en.wikipedia.org/wiki/Canada"},{"scroll":{"tags":["Geography"]}}]}
Note: I'm using the diagrams.net desktop version 14.1.8.
Thank you for taking the time to read this question.
Paul

On Windows this only seems to work if the browser isn't already open. There is not much we can do to fix this as we're passing the link to the OS.

How to make a web page to totally display when loaded in python selenium?

The main objective I have is to read a table in a webpage and account for the total element it has. But since you have to scroll down to find another elements not 'chased' by this sentece
table_css=driver.find_elements_by_id('DeletButtn')
then I decided to zoom in down to 30% to catch them all. But no way I've used is working when trying to zoom in:
driver.execute_script("document.body.style_zoom='30%'")
driver.execute_script("document.body.style.zoom='30%'")
driver.execute_script("document.body.style.zoom='0.3'")
do not work
I have also tried to use key and send keys, but they have been worthless too:
html = driver.find_element_by_tag_name("body")
html.send_keys(Keys.CONTROL, Keys.SUBTRACT)
driver.Keys(html, "ctrl").Keys(html, "-").perform()
action = ActionChains(driver)
action.key_down(Keys.CONTROL).key_down(Keys.SUBTRACT).perform()
None of these combinations have worked up til now. They haven't show me any error so far, so I do not know what I am doing wrong.
I am using python 2.7, geckodriver and Firefox
Thanks for your help and greetings!

Server Side Includes are not being shown

I've gone through every fix online I'm pretty sure but I just can't seem to get this to work. I'm trying to have <#include virtual="footer.ssi" but not matter what it doesn't work.
footer.ssi and the page are in the same place
I really have no clue how to fix this issue but I can't get ANY server sides to show up on any of my pages so I know it's on my end. If anyone has any thoughts they're greatly appreciated!

I believe your code should be this: <!--#include virtual="footer.ssi" -->, this is what I use on my website, and it works perfectly!
See this reference: Html Goodies, and scroll down to The File Argument section.

django - ckeditor bug renders text/string in raw html format

am using django ckeditor. Any text/content entered into its editor renders raw html output on the webpage.
for ex: this is rendered output of ckeditor field (RichTextField) on a webpage;
<p><span style="color:rgb(0, 0, 0)">this is a test file ’s forces durin</span><span style="color:rgb(0, 0, 0)">galla’s good test is one that fails Thereafter, never to fail in real environment. </span></p>
I have been looking for a solution for a long time now but unable to find one :( There are some questions which are similar but none of those have been able to help. It will be helpful if any changes suggested are provided with the exact location where it needs to be changed. Needless to say I am a newbie.
Thanks

You need to mark the relevant variable that contains the html snippet in your template as safe
Obviously you should be sure, that the text comes from trusted users and is safe, because with the safe filter you are disabling a security feature (autoescaping) that Django applies per default.
If your ckeditor is part of a comment form and your mark the entered text as safe, anybody with access to the form could inject Javascipt and other (potentially nasty) stuff in your page.
The whole story is explained pretty well in the official docs: https://docs.djangoproject.com/en/dev/topics/templates/#automatic-html-escaping

Select all frames at once in Selenium

This could be a stupid questions for some. But its truly important for me.
I know how to switch frames using selenium webdriver.
However, is there a way to download all the page_source of the entire page for all the frames at once.
Instead of switching them again and again?
Could someone please let me know the command if it exists?
If not then please say there is none. And that should answer my question.
Thanks in advance

Webdrivers' getPageSource will return some state in some formatting of the last page the driver was on.
From the (java)docs, but most probably applies to other languages:
getPageSource
java.lang.String getPageSource()
Get the source of the last loaded page. If the page has been modified
after loading (for example, by Javascript) there is no guarantee that
the returned text is that of the modified page. Please consult the
documentation of the particular driver being used to determine whether
the returned text reflects the current state of the page or the text
last sent by the web server. The page source returned is a
representation of the underlying DOM: do not expect it to be formatted
or escaped in the same way as the response sent from the web server.
Think of it as an artist's impression.
Returns:
The source of the current page
http://selenium.googlecode.com/git/docs/api/java/org/openqa/selenium/WebDriver.html#getPageSource%28%29

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

content empty when using scrapy - python-2.7

Related

How to link to html page at specific ID bookmark?

How to make a web page to totally display when loaded in python selenium?

Server Side Includes are not being shown

django - ckeditor bug renders text/string in raw html format

Select all frames at once in Selenium

Categories

Resources