What's the http request for the page source?

What's the http request for the page source? - c++

I've managed to make a file downloader in C++ (using winsock). It downloads every simple link with a file like: www.page.com/image.png
I want to make it download all of the images from an entire page, such as all the images from a 4chan thread, but I don't know what I should send in the http request to get the page's source. How can I request the source of a webpage?

You don't send anything in the http request, in the manner you're thinking.
An http request sends a single request, for a single document, and returns a single document from the server.
To download an entire page, you will have to parse the downloaded HTML document, extract all the relative links from the HTML source, then issue a separate http request for every image, css, js, etc... referenced from the main document.
This is how tools like wget's --recursive option download entire pages.

If the page is located at the root of the http://www.page.com server, you would send a GET request to the www.page.com server asking for the / resource:
GET / HTTP/1.1
Host: www.page.com
Let's say the page was actually located at http://www.page.com/thepage.html. You would send a GET request asking for /thepage.html instead:
GET /thepage.html HTTP/1.1
Host: www.page.com
Either way, you would then have to parse the resulting HTML to get the individual URLs of all the <img> tags that are on the page.

Related

Microsoft Graph API - Error: The Content-Range header length does not match the provided number of bytes

I am trying to upload a file to the Shared Documents library of my SharePoint website. The files are of type PDF and HTML. I am running a Cold Fusion development environment and using CFHTTP commands to execute HTTP requests. I have been able push a POST command and a PUT command to the proper endpoints listed on this link below:
Link: https://learn.microsoft.com/en-us/graph/api/driveitem-createuploadsession?view=graph-rest-1.0#best-practices
I do not understand why but the first section that mentions the HTTP requests for creating an upload session is different than what was used in the example a little further. For my project, I am using the endpoint:
"/{variables.instance.microsoftGraphAPIURL}/drive/root:/{item-path}:/createUploadSession"
P.S. variables.instance.microsoftGraphAPIURL is a variable to a microsoft graph endpoint to our Sharepoint website
With better luck using PUT commands than POST commands for creating an Upload Session. I am able to receive an uploadURL, but the issue comes with trying to upload the file. For the file upload, I am trying to upload a file in the same directory with a file size of 114992 bytes. I keep getting "The Content-Range header length does not match the provided number of bytes." whenever I run my Put command to upload the file.
Thus, my Content-Range is "bytes 0-114991/114992" and my Content-Length is "114992". For the image below, I replaced the file with a pdf, but the original file was an HTML page at 114992 bytes. I want to use a resumable upload session to have one function for uploading image, HTML, and PDF files.
If anyone could tell me if there is an issue with my content headers or my upload session http request or anything else that is causing my issue, that would be amazing! Thank you.

Can ZOHO deluge script getUrl() function read HTTP response headers?

When trying to use getUrl() to grab a CSV file from a URL with basic .htaccess authorization, I am redirected to an Amazon S3 location. The getURL() function passes the original HTTP headers (for the auth) to Amazon S3 which Amazon thinks is an Amazon token; this causes the following error in the response:
Only one auth mechanism allowed; only the X-Amz-Algorithm query parameter, Signature query string parameter or the Authorization header should be specified
I can't see these issues talked about anywhere other than an advisory from Thompson Reuters: https://community.developers.thomsonreuters.com/questions/29247/aws-download-x-direct-download-returns-invalid-arg.html
The fix is to receive the redirect back from the remote server, look at the response and pull out the new (redirected) URL and grab the CSV file from there without the auth details in the header.
Is there a way in deluge script ZOHO to do this? The getUrl() function seems really basic and the documentation is very thin.
The other way to do this is a 'middleware' application that can use CURL, save the CSV's on a remote server then use ZOHO getUrl() to pull these CSV files. This is not an optimal solution but unless ZOHO gives access to some HTTP client functions then I don't see another way.

To get the detail of the response headers include detailed:true in the invokeurl request.
Example:
// parameters is a Map
// header is a Map
response = invokeurl
[
url :url
type :POST
parameters:parameters
headers:header
detailed:true
];
// To see all headers and content
info response;
// To see the http response code
info response.get('responseCode');
// With detailed:true any html or json returned will be put in responseText
// info response.get('responseText');
// To see the all http response headers
info response.get('responseHeader');
// To see a specific http response header
// Note: case matters in the response headers name
// "Content-Type" won't find "content-type"
info response.get('responseHeader').get('content-type');
// was the url redirected to another url?
info response.get('responseHeader').get('location');
// get the redirect url
redirect_url = response.get('responseHeader').get('location')
from there you can process the redirect url and pass it to the next http request.
Recommendation:
After working for months both including detailed:true and not including it, I now lean toward always including it. detailed:true includes more useful information and has a helpful regular structure: {responseCode: <code>, responseHeaders: <headers>, responseText: <returned-data>}.

This is possible in Deluge using the invoke URL task - https://www.zoho.com/deluge/help/web-data/invokeurl-task.html#response.
invokeURL can hand over the response headers to you from which you can get the redirect URL and then proceed with the authentication.

Cross-Origin Read Blocking (CORB) issue when making img request

I am currently trying to implement this solution here. The solution seems pretty simple and possible since I am the owner of both of the hosts. On mysite1.com I have added the following img tag.
<img src="//mysite1.com.com/cookie_set/" style="display:none;">
On my site2.com (django), I have a view like so:
def cookie_set(request):
response = HttpResponse()
response.set_cookie('my_cookie', value='awesome')
return response
When I release this code live. I get the following error:
Cross-Origin Read Blocking (CORB) blocked cross-origin response https://www.mysite2.com/cookie_set/ with MIME type text/html. See https://www.chromestatus.com/feature/121212121221 for more details.
I thought that maybe if I just added "Access-Control-Allow-Origin" in my view this might fix things, but according the docs here: https://www.chromium.org/Home/chromium-security/corb-for-developers, there's one more consideration:
For example, it will block a cross-origin text/html response requested from a or tag, replacing it with an empty response instead.
Are my assumptions correct? After adding the correct headers should I just change the content-type to something other than text/html?
Ultimately, my final goal is I would like to set a cookie for a different domain that I have control of (ideally without a redirect).

Best solution: use a different tag for this. (i.e. iframe).
The point behind CORB is to prevent certain tags from being used for XSSI data injection So img tags requests should not return text/html, application/json, or xml content types.
So unless the call to img tag really is for capturing the request itself (for referrer tracking, for example), then you get much more versatility by executing in an iframe anyway (like for SSO-redirection workflows).
See also: Setting third party cookie by using 1x1 <img> tag - Javascript doesn't drop cookie

I fixed this for image files by updating the Content-Type metadata under Properties in S3 - image/jpeg for JPEG files and image/png for PNG files.
My application uploads image files via multer-s3 and it seems it applies Content-Type: 'application/x-www-form-urlencoded'. It has a contentType option with content-type auto-detect feature - this should prevent improper headers and fix the CORB issue.
It seems the latest Chrome 76 version update includes listening to remote file URL headers, specifically Content-Type. CORB was not an issue for other browsers such as Firefox, Safari, and in-app browsers e.g. Instagram.

Save html page after JSON - AJAX post request

I use curl in C++ to download an html page from a website, then I save it.
After I've saved the html file, with another programm I've to read it, and save it in a string.
This page contain some request (POST) made by JSON-AJAX. If I open it with the broswer I have the right content. If I open it with a text editor I have a bad content because the POST request is not made.
So how can I save the page whit the content obtained after JSON-AJAX request??

Curl will download the HTML code from the page and that's it. When you open the HTML file with a web browser, the browser is taking care of whatever post request is being sent.
You need to find out what the post request contains (i.e., the data and how it's obtained) and send that request separately and save the response.
You might want to look into this question How do you make a HTTP request with C++?

Does libcurl load complete page in single shot?

I'm using libcurl to fire HTTP request.
Does lincurl load complete page in single shot,
or for sub-links on page i.e. .css or. png file it request separately.

libcurl does not automatically send any sub-requests for any links in the requested resource. This would be a completely unreasonable behaviour for any linked media.
To retrieve linked media, you have to extract the links from the resource you initially retrieve, and then do separate requests for them as needed (just like a web browser does behind the scenes).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js