I have a webservice that processes and dynamically generates a PDF file from several smaller files. I also have a client application (website) from which users can download the PDF file.
Sometimes the result PDF is very large, and because of memory and response time restrictions the website user is not able to download it.
I am trying to find a way to make my website receive smaller packages, or something like that, and gradually display it in the user browser. Something similar to a video stream, but with PDF files. Is there any way I can accomplish that?
*EDIT: I found Mozilla's PDF.js a client-side PDF viewer built with HTML5 that I think would be useful to my issue, but I still haven't found in the docs how to solve the problem.
The solution proposed by Dan O would help me to improve the user experience, but not the memory problem in my server (webservice) and total time to process the PDF.
use HTTP chunking. Briefly:
set the Transfer-Encoding: chunked header in your HTTP response
send your PDF file in chunks using the following format:
<chunk size in bytes>
\r\n
<chunk of data>
\r\n
send an empty chunk to tell your user's browser that no more data is coming:
0
\r\n
\r\n
Related
Postman let’s you upload files via multipart/form-data or as a binary. What’s the difference between these two?
Binary data
Binary data allows you to send things which you can not enter in
Postman, for example, image, audio, or video files. You can send text
files as well.
Form-data
multipart/form-data is the default encoding a web form uses to
transfer data. This simulates filling a form on a website, and
submitting it. The form-data editor lets you set key-value pairs
(using the data editor for your data.) It also lets you specify the
content type for each part of a multi-part form request individually.
You can attach files to a key as well.
When you repeatedly make API calls that require sending these files
again and again, Postman persists your file paths for subsequent use.
This also helps you run collections that contain requests requiring
file upload.
Uploading multiple files each with their own Content-Type is not
supported yet.
Good read: Postman Learning Center
I am trying to access a proprietary website which provides access to a large database. The database is quite large (many billions of entries). Each entry in the database is a link to a webpage that is essentially a flat file containing the information that I need.
I have about 2000 entries from the database and their corresponding webpages in the database. I have two related issues that I am trying to resolve:
How to get wget (or any other similar program) to read cookie data. I downloaded my cookies from google chrome (using: https://chrome.google.com/webstore/detail/cookiestxt/njabckikapfpffapmjgojcnbfjonfjfg?hl=en) but for some reason the html downloaded by wget still cannot be rendered as a webpage. Similarly, I have not been able to get Google Chrome from the command line to read cookies. These cookies are needed to access the database, since they contain my credentials.
In my context, it would be OK if the webpage was downloaded as a PDF, but I cannot seem to figure out how to download a webpage as a pdf using wget or similar tools. I tried using automate-save-page-as (https://github.com/abiyani/automate-save-page-as) but I continuously get an error of the browser not being in my PATH.
I solved both of these issues:
Problem 1: I switched away from wget, curl and python's requests to simply using the selenium webdriver in python. Using selenium, I did not have to deal with issues such as passing cookies,headers, post and get, since it actually opens a browser. This also has a plus that as I was writing the script to use selenium, I could inspect the page and see what it was doing as it was doing it.
Problem 2: Selenium has a method called page_source, which downloaded the html of the webpage. When I tested it, it rendered the html correctly.
I have just started with Python Web scraping through Requests. This could be a broad question, I will try to make it as brief as possible.
I came through situation where sometimes an entire page source can be downloaded with r.content (where r is a response object of requests's get call)
Sometimes some part of the data is stored in json format... In files that can be accessed by deeply observing the get and post calls made.
However, I even found websites where the entire content is in DOM but part of it is neither in Page source nor in Json files.
I am wondering how many of such places can a website store a data in?
(Just the names, I am not looking for how to get there)
For these last type of websites, I have observed almost every requests call made, but couldn't find where the data is.
So are there any other place except the 2 mentioned above? Or those are the only two indicating I am not doing my job right of observing the requests call?
You may answer it in brief bullet points and I can take my study from there.
Thanks in advance.
Lets assume we are talking only about HTML data. A web server could serve you data in many other formats (JSON/XML, etc.)
Please note that what I have described is generalisation, and like most generalisations, you could find exceptions that do not fit in it.
Broadly we could divide the type of data displayed (for the end user) into two categories
Pre render
Post render
Pre render
The entire HTML page is constructed at server-side and sent across to the client. Here, the JS side is concerned with the user interaction, and not with the structure of the data.
We are slowly moving away from this type of structure, but currently a large majority of all web pages uses this.
Web scraping is relatively easy here, as we can programatically pull the html page, and not bother about the javascript code that accompanies it.
a combination of requests and beautifulsoup should work in almost all of the cases (assuming that you could identify the general structure of the document).
Post render
Here the HTML page that is returned from the server is just a "skeleton" or placeholders for the actual data. The data is rendered by the accompanying JS code.
In such cases, if you fetch the source file via for eg., requests, you will get an empty shell, with no data in it.
for this if you inspect the calls made by a browser while rendering, (chrome's network tab or firefox's inspect tool or the more popular firebug), you will most likely see ajax requests that brings back the actual data from the server)
depending on how the requests are made, you could hit that ajax endpoint, and get the data in JSON.
you could use response.json() function to extract it into python-dicts.
In certain (rare) cases, there would not be an ajax call, but the HTML served from the server will still be a shell. The actual data is part of that file served, but stored as part of the JS code itself. This could be done for a variety of reasons, for example for dynamic data to be sent to static js files, or just to deter simple attempts of scraping the page.
One approach to scraping such pages would be to 'render' the page in a headless browser, which executes the JS code and returns an HTML that could be parsed via parsers like beautifulsoup
beautifulsoup has the ability to work with many parsers, one of which is html5lib, which could solve this issue.
you could also look at selenium or mechanize
or you could try parsing the js code yourself which might be faster.
Arriving at a conclusion as to what to use requires careful inspection of how the page is rendered on a browser. Even if you don't see an ajax request, the html that is served by the server need not be how the browser displays it.
A good way to start is by looking at the bare-html that is being served, by either downloading the page via curl or requests.get or simply rendering it in your browser with javascript disabled.
Good luck.
Here's the deal: I had a excel table that fulfills a MySQL table. I already made a procedure in server side who receives the sheet, read it and put it on the database. Saddly the sheet and data table doesn't have the same structure, so I need to use a php object/script in server side to manipulate it. I have a interface to upload the file (excel file), so the PHP program can read it...
...but my boss job isn't make my life easier, is it? NO! He says that is a lot of work have to upload every excel file by the web interface. So, he asked me to make a button in the sheet that he might click after his "job" is done. That would replace the web interface.
But, the system itself is a interface that would be saled one day (well, it's the plan!). So, I just can't just role out the web interface.
WHAT I'M ASKING IS: There's a way that I could send a file (the sheet itself) in a post method straight from the VBA Macro without using XML files and name each data that I'm sending, like a form post?
So far, I've found some tutorials or even some SO posts that made me get somewhere. But all of them were talking about a XML, and I already have a method that receives a HTTP POST (from a form) and work. I aiming to reuse the same method. From my VBA script I'm already able to make the request (not a big deal) and post it. But, in the server-side script, I'm expecting a POST come out from a form, so it calls a field's name. I don't seen to be able to do that from a VBA post. =/
Here's the answer... the two first functions/methods define how to send a file to a web service. You only need the file path and the URL from service. It has answered even more than I expected. :D
I often click on a file link in the IE and a download box just pops out. But what happens behind this scene? I know that IE always talks to web server with HTTP protocol, and HTTP is text based.
So is IE download achieved with HTTP protocol? If so, how could arbitrary file format be downloaded over a text based protocol?
And I am currently trying to make a web app which will direct my customer to download some file. My current design is to implement a web service. Customer will call this web service and the web service will return the file download URL. But then I don't know what to do with the URL. Could I just use something like File.Copy to copy the file from the URL to local disk? Or how should I treat the URL? If there's a better design, please teach me.
Many thanks...
By specifying the right content type, you can tell the browser what kind of data it is you are sending.
In addition, there are special encodings (like Base 64) that allow binary content to be displayed as text, using only a limited set of characters and escaping everything else.
Then, there is nothing you need to do with the url. IE will know whether it can or cannot open the file and will show the download box accordingly.
maybe it's like
<?php
// We'll be outputting a PDF
header('Content-type: application/pdf');
// It will be called downloaded.pdf
header('Content-Disposition: attachment; filename="downloaded.pdf"');
// The PDF source is in original.pdf
readfile('original.pdf');
?>