In Django I'm looking for a way to serve several different files at once. I can't use static archives (.zip, .tar, etc.) because I don't have enough storage to cache these files and it will take far too long to generate them on the fly (each could be in the 100s of megabytes).
Is there a way I can indicate to the browser that several files are coming its way? Perhaps there is a container format that I can indicate before streaming files to the user?
Edit: There could be hundreds of files in each package so asking the user to download each one is very time consuming.
Ah, the .tar file format can be streamed. I'll experiment with this for now.
http://docs.python.org/library/tarfile.html
Related
Is it possible to concatenate 1000 CSV file that have header into one file with no duplicated header directly in Google Cloud Storage? I could easily do this by downloading the file into my local hard drive but I would prefer to do it natively in Cloud Storage.
They all have same columns, and have header row.
I wrote an article to handle CSV files with BigQuery. To avoid several files, and if the volume is less than 1Gb, the recommended way is the following
Create a temporary table in BigQuery with all your CSV.
Use the Export API (not the export function)
Let me know if you need more guidance.
The problem with most solutions is that you still end up with a large number of split files where you have to then strip the headers and join them, etc...
Any method of avoiding multiple files tends to be also quite a lot of extra work.
It gets to be quite a hassle especially when big query spits out 3500 split gzipped csv files.
I needed a simple and batch file automatable method for achieving this.
Therefore wrote a CSV Merge (Sorry windows only though) to solve exactly this problem.
https://github.com/tcwicks/DataUtilities
Download latest release, unzip and use.
Also wrote an article on with scenario and usage examples:
https://medium.com/#TCWicks/merge-multiple-csv-flat-files-exported-from-bigquery-redshift-etc-d10aa0a36826
Hope it is of use to someone.
p.s. Recommend tab delimited over CSV as it tends to have less data issues.
One of the Google Dataflow utility templates allows us to do compression for files in GCS (Bulk Compress Cloud Storage files).
While it is possible to have multiple inputs for the parameter that consist of different folders (e.g: inputFilePattern=gs://YOUR_BUCKET_NAME/uncompressed/**.csv,), is it actually possible to store the 'compressed'/processed files into the same folder where it was stored initially?
If you have a look at the documentation:
The extensions appended will be one of: .bzip2, .deflate, .gz.
Therefore, the new compressed files won't match the provided pattern (*.csv). And thus, you can store them in the same folder without conflict.
In addition, this process is a batch process. When you look deeper in the dataflow IO component, especially to read with a pattern into GCS, the file list (of file to compress) is read at the beginning of the job and thus don't evolve during the job.
Therefore, if you have new files that come in and which match the pattern during a job, they won't take into account by the current job. You will have to run another job to take these new files.
Eventually, a last thing: the existing uncompressed files aren't replaced by the compressed ones. That means you will have the file in double: compressed and uncompressed version. To save space (and money) I recommend you to delete one of the two version.
This post suggests the limit for the number of files open-able in few OSs. And as per this post, the number of files in Linux can be changed.
In an app written in C++, I am opening several SQLite database files using sqlite3_open_v2() at a time for several users' DBs; there is no fstream in this.
I was curious to know, what is the file limit for SQLite DB files? Is it the same as normal files or are they treated differently.
Add-on: Suppose the file number is limited; in such case, what is the correct design to open-close such .db files without impacting much performance?
I have been asked by a client to extract text from pdf files stored in zip archives on dropbox. I want to know how (and whether it is possible) to access these files using PowerShell. (I've read about APIs you can use to access things on dropbox, but have no idea how this could integrate in a PowerShell script) I'd ideally like to avoid downloading them, as there are around 7000 of them. What I want is a script to read the content of these files online, in dropbox, and then to process the relevant data (text) into a spreadsheet.
Just to reiterate - (i) Is it possible to access pdf files from Dropbox (and the text in them) which are stored in zip archives, and (ii) How can one go about this using PowerShell - what sort of script/instructions does one need to write
Note: I am still finding my way around PowerShell, so it is hard for me to elaborate - however, as and when I become more familiar, I will happily update this post.
The only officially supported programmatic interface for Dropbox is the Dropbox API:
https://www.dropbox.com/developers
It does let you access file contents, e.g., using /files (GET):
https://www.dropbox.com/developers/core/docs#files-GET
However, it doesn't offer any ability to interact with the contents of zip files remotely. (Dropbox just considers zip files as blobs of data like any other file.) That being the case, exactly what you want isn't possible, since you can't look inside the zip files without downloading them first. (Likewise, even if the PDF files weren't in zip files, the Dropbox API doesn't currently offer any ability to search the text in the PDF files remotely. You would still need to download them.)
Collecting the admin-media files from different django application is one of the not-so-good things about django. Usually you have to file copying from the module distributions to your directory every time one you install/update/remove a module.
When you are using several django application that do have their own media/admin files, some of them even overriding others you need a proper way of collecting them and building a correct media directory.
A proper solution should be able to recreate the directory by collecting the required files
from the modules in a specific order, allowing them to override each other.
Useful links
Proposal: installmedia command - A story for distributing media with apps
What solution do you have for this issue?
I just stack multiple Alias directives one on top of each other, grafting the media to the appropriate place on the tree.
For the moment the best solution I found was to use http://djangosnippets.org/snippets/1068/
but I'm still looking for better approaches.