This is my first question on Stack Overflow and I know it won't be my last. I am wonder what do server-side languages actualy do? Do they generate a html file based on whatever your code says?
Server side languages compile to generate an HTML page that is sent to the client (to the browser) over HTTP or HTTPS. For example, in ASP, each HTML page has a "Code Behind" attribute which represents server side language (such as C#), which, when running, alternates the HTML page and sends it rendered and finalized to the client. You can read much more here.
Server side languages are useful when you want to display a dynamic web pages where content changes with data saved in your database for example. One of the example is PHP.
Related
The recent Windows 10 update for KB5003637 seems to have caused our use of the WebBrowser control to fail. Our applications use a C++ dialog that hosts a web browser control based on the IWebBrowser2 interface and implemented by the COM class 8856f961-340a-11d0-a96b-00c04fd705a2. The control interacts with a bespoke internal 'web server' that is hosted on a localhost port. The web browser is rendering dynamic HTML with a bunch of css and javascript. It's a legacy app that has been working reliably for many years.
Our users that have Windows 10 versions 2004, 20H2, and 21H1 are installing the KB5003637, and when they do the web browser does not render the content that it did before.
Looking at some trace, I can see that the Web Browser is requesting the page's HTML, which seems to be delivered as it should. What normally happens at that time is that the web browser control requests the css and javascript files needed to make the page active. What happens instead is nothing.
The KB5003637 update is pretty big, but does contain fixes for some scripting vulnerabilities described in CVE-2021-31959 which are very much on point. Nothing that I've found so far indicates how this was fixed, the effect that it has on the WebBrowser control, nor what workarounds there might be.
Any help would be appreciated.
Turns out that the Windows update I described did change the behavior of the WebBrowser control. Our bespoke web server was not including content type headers for responses to the WebBrowser's request. For the last decade or more, the control was successfully able to figure out what the content was OR it defaulted to the correct content type in the cases that mattered. After the update, the WebBrowser was defaulting to a content type of 'text' for the initial HTML payload. As a result it was not trying to interpret the payload as HTML and therefore no further actions were necessary (like requesting css and js files).
When I changed the code to include a content type header of "text/html" for the initial payload, the application began working. Content type headers are now included with all replies.
Firstly sorry if this question has been asked before but I'm a novice so even if it has I'm unaware of the language I'd even use to try and seek it out.
I'm beginning to learn about REST API's and it got me thinking. Is it possible to load the JSON response directly from the API server into the user's browser and bypass your own server?
Imagine you have say a Django app running on a server that accesses email messages from Outlook.com using the graph API.
I assume an ordinary flow would go something like:
User request->your server->graph api-> your server-> user browser.
It seems like a waste for it to hit your server that second time before it goes on to be presented to the user's browser.
Is there a way the Django app can render a template and effectively tell the browser "expect some data from X source, and place it in y location in this template"?
You could do that with javascript. You'd have to include either a script tag in your template, or create and include some static javascript files with the code.
I'd recommend learning and using the jQuery javascript library, as it makes what you're talking about much easier to implement. Research ajax requests, those are what you'll need to make requests directly to another server, bypassing your own
I have just started with Python Web scraping through Requests. This could be a broad question, I will try to make it as brief as possible.
I came through situation where sometimes an entire page source can be downloaded with r.content (where r is a response object of requests's get call)
Sometimes some part of the data is stored in json format... In files that can be accessed by deeply observing the get and post calls made.
However, I even found websites where the entire content is in DOM but part of it is neither in Page source nor in Json files.
I am wondering how many of such places can a website store a data in?
(Just the names, I am not looking for how to get there)
For these last type of websites, I have observed almost every requests call made, but couldn't find where the data is.
So are there any other place except the 2 mentioned above? Or those are the only two indicating I am not doing my job right of observing the requests call?
You may answer it in brief bullet points and I can take my study from there.
Thanks in advance.
Lets assume we are talking only about HTML data. A web server could serve you data in many other formats (JSON/XML, etc.)
Please note that what I have described is generalisation, and like most generalisations, you could find exceptions that do not fit in it.
Broadly we could divide the type of data displayed (for the end user) into two categories
Pre render
Post render
Pre render
The entire HTML page is constructed at server-side and sent across to the client. Here, the JS side is concerned with the user interaction, and not with the structure of the data.
We are slowly moving away from this type of structure, but currently a large majority of all web pages uses this.
Web scraping is relatively easy here, as we can programatically pull the html page, and not bother about the javascript code that accompanies it.
a combination of requests and beautifulsoup should work in almost all of the cases (assuming that you could identify the general structure of the document).
Post render
Here the HTML page that is returned from the server is just a "skeleton" or placeholders for the actual data. The data is rendered by the accompanying JS code.
In such cases, if you fetch the source file via for eg., requests, you will get an empty shell, with no data in it.
for this if you inspect the calls made by a browser while rendering, (chrome's network tab or firefox's inspect tool or the more popular firebug), you will most likely see ajax requests that brings back the actual data from the server)
depending on how the requests are made, you could hit that ajax endpoint, and get the data in JSON.
you could use response.json() function to extract it into python-dicts.
In certain (rare) cases, there would not be an ajax call, but the HTML served from the server will still be a shell. The actual data is part of that file served, but stored as part of the JS code itself. This could be done for a variety of reasons, for example for dynamic data to be sent to static js files, or just to deter simple attempts of scraping the page.
One approach to scraping such pages would be to 'render' the page in a headless browser, which executes the JS code and returns an HTML that could be parsed via parsers like beautifulsoup
beautifulsoup has the ability to work with many parsers, one of which is html5lib, which could solve this issue.
you could also look at selenium or mechanize
or you could try parsing the js code yourself which might be faster.
Arriving at a conclusion as to what to use requires careful inspection of how the page is rendered on a browser. Even if you don't see an ajax request, the html that is served by the server need not be how the browser displays it.
A good way to start is by looking at the bare-html that is being served, by either downloading the page via curl or requests.get or simply rendering it in your browser with javascript disabled.
Good luck.
I'm interested in emulating the functionality of a web browser in C++ so that I can create a wrapper for several web sites. Right now, the biggest issues with these sites are that they make heavy use of JavaScript that interacts with the HTML DOM. Thus, the simple solution of using curl to download the page, and something like RapidXML to parse its contents is out.
Next, I considered using something like v8 with curl, and that solves the issue of interpreting the JavaScript on the page nicely. However, it doesn't solve the issue of connecting the HTML DOM methods with the JavaScript; in other words, document.getElementById() would fail in v8.
Next, I considered WebKit, which seems like it's perfectly suited to emulate a web browser--after all, Chromium and Safari both utilize it in their web browsers. However, it's a little too complete. I don't need all of the rendering aspects it includes.
So, I'd be looking for some way to:
Make an SSL connection to a web site
Interpret the JavaScript on that web site in connection with the HTML DOM
Set the value of the username/passwords <input> fields with my username and password
Simulate clicking the "Submit" button by calling the formSubmit() function, from <input type="button" onClick="formSubmit()">
Handle the HTTP POST form action and the subsequent HTTP 301 and JavaScript redirects (accomplished using window.location)
Repeat 2-5 as needed
Besides what I've already considered, what other options do I have? Ideally, I'd want this to be extremely lightweight, without requiring linking to many libraries.
I'm primarily concerned with developing for Windows 7 64-bit.
Well, this sounds all too much like a brute-force program. Disregarding that, and since you don't seem to need to render any website, I think you should just fetch the file through cURL or something, then parse it, check for the form through using a regex, retrieve the form action, then make a request using the method taken from the <form> tag and whichever input you want.
Problem is, there would be no proper way to know when is it that you've logged in properly, unless you made some kind of per-site checking. This comes mainly from the fact that many sites use sessions rather than direct cookies or HTTP auth, and since you can't read from sessions directly, it is impossible for you to guess when the session has changed.
That's the most lightweight solution I can come up with right now.
I often click on a file link in the IE and a download box just pops out. But what happens behind this scene? I know that IE always talks to web server with HTTP protocol, and HTTP is text based.
So is IE download achieved with HTTP protocol? If so, how could arbitrary file format be downloaded over a text based protocol?
And I am currently trying to make a web app which will direct my customer to download some file. My current design is to implement a web service. Customer will call this web service and the web service will return the file download URL. But then I don't know what to do with the URL. Could I just use something like File.Copy to copy the file from the URL to local disk? Or how should I treat the URL? If there's a better design, please teach me.
Many thanks...
By specifying the right content type, you can tell the browser what kind of data it is you are sending.
In addition, there are special encodings (like Base 64) that allow binary content to be displayed as text, using only a limited set of characters and escaping everything else.
Then, there is nothing you need to do with the url. IE will know whether it can or cannot open the file and will show the download box accordingly.
maybe it's like
<?php
// We'll be outputting a PDF
header('Content-type: application/pdf');
// It will be called downloaded.pdf
header('Content-Disposition: attachment; filename="downloaded.pdf"');
// The PDF source is in original.pdf
readfile('original.pdf');
?>