i have problems in handling a simple html-link with the IWebBrowser2-Interface.
Image i navigate to an url like that
scheme://host:port/path.html
The file path.html contains html/javascript code that automatically links to another html file, that is located in an url like this
scheme://host:port/target.html
The first file is loaded fine. But once the IE handles the link from path.html the value of the url, which is passed as a parameter to the OnBeforeNavigate2-Handler contains a strange value looking like this
"about:target.html"
There is indeed the name of the file, that path.html linked to, but the whole
remaining part of the url is lost.
Does anybody has an idea how this inconvenience could be resolved?
Thanks in advance.
Related
I am working on an offline web browser that open content from archive files.
I use the QWebEngineView class to display the content and it works for most of the files I use.
But I have the case where there are links whose relative url contains "../" at the beginning and Qt doesn't seem to correctly interpret this type of path.
For example on the page which url is "question/8/turing-completeness-in-conlangs.html", there is a link like this and it redirects to "question/tag/grammar/1.html".
Is this behaviour normal ? Is there anywhere I can modify it ?
I already tried to fix this problem in the QWebEngineUrlSchemeHandler::requestStarted method but the url of QWebEngineUrlRequestJob *request is already wrong
Thanks for everyone in advance.
I encountered a problem when using Scrapy on Python 2.7.
The webpage I tried to crawl is a discussion board for Chinese stock market.
When I tried to get the first number "42177" just under the banner of this page (the number you see on that webpage may not be the number you see in the picture shown here, because it represents the number of times this article has been read and is updated realtime...), I always get an empty content. I am aware that this might be the dynamic content issue, but yet don't have a clue how to crawl it properly.
The code I used is:
item["read"] = info.xpath("div[#id='zwmbti']/div[#id='zwmbtilr']/span[#class='tc1']/text()").extract()
I think the xpath is set correctly and I have checked the return value of this response and it indeed told me that there is nothing under this directory. Results shown here:'read': [u'<div id="zwmbtilr"></div>']
If it has something, there should be something between <div id="zwmbtilr"> and </div>.
Really appreciated if you guys share any thoughts on this!
I just opened your link in Firefox with NoScript enabled. There nothing inside the <div #id='zwmbtilr'></div>. If I enable the javascripts, I can see the content you want. So, as you already new, it is a dynamic content issue.
Your first option is try to identify the request generated by javascript. If you can do that, you can send the same request from scrapy. If you can't do it, the next option is usually to use some package with javascript/browser emulation or someting like that. Something like ScrapyJS or Scrapy + Selenium.
I have a Qt program that downloads webpages (HTML), parses them and then generates its own HTML which is then displayed with QWebPage. Some times the HTML that I download contains IMG tags, which work fine when the src attribute contains a full URL. However, some times the IMG tag might use a relative path like:
<IMG SRC="images/foo.png" />
Since I know the URL that should be prepended to the SRC my first thought was to just tack it onto my resulting HTML when I'm parsing. However, this is proving more difficult than I anticipated and now I'm wondering if there's a better way.
If there any mechanism/property with QWebPage that I can say "use this URL for relative paths"? Or maybe someone can suggest a better way to accomplish what I want?
Thanks!
In the comments, you mentioned that you're using QWebView::setHtml(). The second, optional parameter of this method sets the URL to use for resolving relative paths. According to the documentation:
External objects such as stylesheets or images referenced in the HTML
document are located relative to baseUrl.
Setting that parameter should be all that's needed here.
I am not able to understand what actually parsing the html means ?
As i understand -
- it means that suppose we have any html file by parsing we can have the contents of the html file and we can edit them using parsing. Am i right ?? (parsing simply gives the idea about the contents and structure inside the file.)
I have one more question-
- I also want to know that suppose i have html file contents stored in a stream suppose (inside IStream *HTMLContents - No matter for now that how i got these contents). Is there any process exist that using these file contents may i create the preview on any window/Dialog Box/Preview pane with the same way exactly as i get the view of that html file in the browser.(for now you can imagine that i have downloded the HTML File contents from any web page(or from any where-No matter- But i have contents of html file in my stream i am sure about it) and i want to render that html file view in my own created window/Dialog Box/Preview pane(i mean it should view exactly as it appears in browser-Yes i know it won't be avle to display some pictures in html file but thats not a problem for me). How to do that ?? (I am using Visual c++ for my accomplishing my task)
Parsing basically means analyzing any data. When you parse HTML, it could be that you are figuring out where all the various elements are located and what do they do.
As for displaying HTML, it depends on what do you want to do:
If you want to open the file in your browser, use something like this.
As for displaying HTML directly in your form, I don't really know of any other way than parsing the HTML and creating your own web rendering engine. Good luck and have fun with that I guess.
Parse HTML means build object model such as DOM: https://en.wikipedia.org/wiki/Document_Object_Model in your program
I've created a web browser using mfc and i'm using IHhmlReader to read the contents of html when the user enters a url in the browser and page is completely loaded.Now i want to check if the webpage has any flash in it.
Any Helps would be highly appreciated.
Thank You.
I think this is a bit difficult to do, just reading from the HTML source, unless you try to instantiate the page and see if it's making a call to the Flash object. I have listed some options you can try, but you'll need to make sure that the code element is not commented out and check include files and iframes to see if Flash is called from there.
* Look for the OBJECT and EMBED tags (see http://kb2.adobe.com/cps/127/tn_12701.html)
* In page's JavaScript, look for SWFObject() call
* Look for the call to .swf file (could even be in an img tag)
Good luck...