How to simply parse html references - c++

how it is possible ,to simply parse html links. For example I receive http response containing http. In which you have links to other files, which need to be downloaded for example jpgs, css files,js files. What is the simplest way to parse all this references.

Use an HTML parser for your platform/language.
There are some recommendations for c++ ones here.
Once you have a parsed document, you will need to look at each src and href in it - you will also need to remember the base tag, if one exists and add logic for external, relative and absolute paths.

Related

Rewrite first folder to GET param for all php files

I am desperately looking for a rule to achieve the following:
Input URL request would be:
http://myserver.com/param/other/folders/and/files.php
It should redirect to
http://myserver.com/other/folders/and/files.php?p=param
similarly the basic index request
http://myserver.com/param/
would redirect to
http://myserver.com/?p=param
All my php files need the parameter, wherever they are. It'd be nice if JS and CSS files would be excluded but I guess it doesn't really matter since the /file.css?p=param would just be ignored and not cause a problem. I have found rules to map a folder to the GET parameter but none of them are working for php files deeper than the index file on the root level. Thanks so much in advance
Replace
http:\/\/([^\/]+)\/(\w+)\/(.*)
with
http:\/\/\1/\?p=\2\/\3
example regex page at https://regex101.com/r/sU6lR9/1

Getting WebPage to use a specific URL to download HTML resources

I have a Qt program that downloads webpages (HTML), parses them and then generates its own HTML which is then displayed with QWebPage. Some times the HTML that I download contains IMG tags, which work fine when the src attribute contains a full URL. However, some times the IMG tag might use a relative path like:
<IMG SRC="images/foo.png" />
Since I know the URL that should be prepended to the SRC my first thought was to just tack it onto my resulting HTML when I'm parsing. However, this is proving more difficult than I anticipated and now I'm wondering if there's a better way.
If there any mechanism/property with QWebPage that I can say "use this URL for relative paths"? Or maybe someone can suggest a better way to accomplish what I want?
Thanks!
In the comments, you mentioned that you're using QWebView::setHtml(). The second, optional parameter of this method sets the URL to use for resolving relative paths. According to the documentation:
External objects such as stylesheets or images referenced in the HTML
document are located relative to baseUrl.
Setting that parameter should be all that's needed here.

Don't include header on first page (CFPDF + DDX)

I'm using DDX to add headers, footers, and pagination to PDF documents. If possible I would like the header for the first page of each file to be blank, but then to have headers for the remaining pages.
I've looked through the documentation and can't find a way to do this. It seems like a commonly used feature so I'm guessing there must be some way to implement it.
(From the comments)
You might be able to achieve this effect by using multiple <PDF> tags: one for the first page and another for pages 2-N, with a nested <Header> tag.
ie :
<PDF pages="1" src="c:/path/someFile.pdf">
...
</PDF>
<PDF pages="2-last" src="c:/path/someFile.pdf">
<Header...>
</PDF>

Realtime URI-translation of HTML content in C/C++

For the development of a custom reverse proxy (written in C++) I want to do a realtime translation of URIs in HTML content. For example if I want to access a ressource on http://myserver/ using http://my-reverse-proxy/myserver, all absolute and toplevel links like http://myserver/somecontent1.ext or /somecontent2.ext need to be modified.
An HTML tag
<img src="/sample.png">
would therefore be translated to
<img src="/myserver/sample.png">
From my point of view there are to approaches:
1) Using regular expressions and string replacement to find all related HTML tags and their paths using capture groups and do some string replacement.
2) Parse entire HTML content, do some transformation on the parse tree and pretty-print the result back to a valid HTML ressource.
And this is what this question is all about: Do you have any experiences what solution might be faster and maybe even more reasonable? Do you know a framework I might use to not reinvent the wheel? As this process should be used later for CSS and XML-based ressources as well, it should not be a HTML-depend solution.
Thanks in advance!
Proxy servers generally work by being servers. They handle all HTTP requests, modify the requested URLs, and then pass the modified request on to the server on the other side.
You should stick to this paradigm. It is far easier and more efficient than mucking around with the files themselves. Anything that is being done real-time can be done at the point of the request.
Also, it should probably be asked: why a custom reverse proxy? Such things exist already.

Does gwan support SSI?

Does gwan support SSI or there is another way to merge different HTML data ?
I'm not sure it's the best way, but I want to include static HTML data into another HTML files ... What do you use if SSI is not working?
SSI can be archived by G-WAN's Dynamic buffers API: simply replace a HTML comment with your partial HTML content.
Use xbuf_frfile() to load your HTML templates.
Use xbuf_repl() to "include" the partial HTML to your host page.
See contact.c from G-WAN samples.