Where to store user's media - django

I have a web-service google cloud. It consists of:
load balancer;
group of instances (django code);
sql service;
But I don't know where to store user's files. Should I use separate media server, or google cloud storage?
How do you usually solve such problems?

Well it will depend on what is your scenario and your requirements.
If you want full ECM capaibility for your files like versioning, different renditions, security policies , metadata etc then some ECM vendors like Alfresco, Sharepoint (cloud or on premise) would be a solution.
If you only want to store the files and retrieve them and also with some metadata then using a File Storage, plain Cloud Storage like Google cloud, Dropbox or even an object storage like S3/Swift could be solution.
Object storage fares better when concurrency is higher and you have requirements for scalabalitiy .If performance is a concern then file storage will do better.

Related

Zip images on web server and return the url

I am currently looking for a way to improve the traffic flow of an app.
Currently the user uploads his data via the app, using Google Cloud Platform as storage provider. Other users can then download this data again.
This works well so far, but since the download traffic at GCP is relatively expensive I had the idea to outsource this to a cheap web server.
The idea is that the user requests the file(s) at GCP. There it is checked if the file(s) are already on the web server. If not, the file(s) will be uploaded to the server.
At the server the files are zipped and the link is sent back to GCP, where it is emailed to the user.
TL:DR My question is, how can i zip a specific selection of files on a web server without nodejs etc. and send the link of the generated file back to GCP
I'm open for other ideas aswell
This is a particular case, covered by Google Cloud CDN (Content Delivery Network) service.
As you can read here, there already is a way to connect the CDN to a Storage bucket, and it will do exactly what you've thought to do with your own web server. The only difference is that it's already production ready. It handles cache misses, cache hits and so on.
You can compare the prices: here you can find CDN prices, and here you can find Storage prices. The important difference is that Storage costs per TB of Egress, meanwhile CDN costs per 10TB of Egress, and the price is still lower.
Of course, you can still stick to your idea. I would implement it by developing a REST API. The API, with just one endpoint will serve the file, if it is present on the web server. If it is not present, it will:
perform a redirect to the direct link for the file hosted in Storage;
start to fetch the file form Storage and put it in the cache.
You would still need to handle the cache: what happens when somebody changes a file? That's something related to the way you're working with those files, so it strictly depends on your app functional domain, and in any case, Cloud CDN would solve it without any further development.

How do I transfer images from public database to Google Cloud Bucket without downloading locally

I have a a csv file that has over 10,000 urls pointing to images on the internet. I want to perform some machine learning task on them. I am using Google Cloud Platform infrastructure for this task. My first task is to transfer all this images from the urls to a GCP bucket, so that I can access them later via docker containers.
I do not want to download them locally first and then upload them as that is just too much work, instead just transfer them directly to bucket. I have looked at Storage Transfer Service and for my specific case I think, I will be using a URL list. Can anyone help me figure out how do I proceed next. Is this even a possible option?
If yes, how do I generate an MD5 has that is mentioned here for each url in my list and also get the number of bytes for image for each url ?
As you noted, Storage Transfer Service requires that you provide it with the MD5 of each file. Fortunately, many HTTP servers may provide you with the MD5 of an object without requiring that you download it. Issuing an HTTP HEAD request may result in the server providing you with a Content-MD5 header in its response, which may not be in the form that Storage Transfer service requires, but it can be converted into that form.
The downside here is that web servers are not necessarily going to provide you with that information. There's no way of knowing without checking.
Another option worth considering is to set up one or more GCE instances and run a script from there to download the objects to your GCE instance and from there upload them into GCS. This still involves downloading them "locally," but locally no longer means a place off of Google Cloud, which should speed things up substantially. You can also divide up the work by splitting your CSV file into, say, 10 files with 1000 objects each in them, and setting up 10 GCE instances to do the work.

Google Apps - Data Transfer API - transfert only some ressources

I'm trying to use the new Data Transfer API for Google Apps Domain and I would like to transfer some specific Google Drive files from one user to another. It seems we can use this API to transfer a "full service" (eg: all files from Google Drive) and not only some specific files.
Is my understanding of this API is correct or is it possible to limit the transfer to specific resources?
Thank you.
You're correct. The API enables you to transfer ownership of application data (currently Drive documents and Google+ pages) in bulk. It essentially allows you to automate the manual ownership transfer task documented here. You might want to read this blog here which has some useful background information.
The only way to achieve what you want is to use the Drive API (not to be confused with the Drive SDK).

What are some of the most appropriate ways for serving a large scale django app on Google Compute Engine?

I am working on a project that will presumably have a lot of user uploaded content and also a fairly large user base. I am now looking for deploying this app to the Google Compute Engine.
I have looked up for the possible options and nginx+gunicorn seems to be a good option. In the beginning I am going to be using a single ns-1 instance with 100 GB persistent hard drive and google cloud sql for serving my database.
But I want to make things scalable so that I can add more instances and disk storage without any hustle in the future. But I am very confused how to do that. So the main concern is.
I want such setup so that I can extend my disk space and no. of Google Compute Instances whenever I want.
In order to have a fully scalable architecture, a good approach is to separate computation / serving, from file storage, and both from data storage. Going part by part:
file storage - Google Cloud Storage - by storing common service files in a GCS bucket, you get a central repository that is both highly-redundant, and scalable;
data storage - Google Cloud SQL - gives you a highly reliable, scalable MySQL-like database back-end, which can be resized at will to accommodate increasing database usage;
front-ends - GCE instance group - template-generated web / computation front-ends, setting up a resource pool into which a forwarding rule (load balancer) distributes incoming connections.
In a nutshell, this is one of the most adaptable set-ups I can think of, while you keep control over every aspect of the service and underlying infrastructure.
A simple approach would be to run a Python app on Google App Engine, which will auto-scale your instances (both up and down) and it supports Django, as mentioned by #spirulence in the comments.
Here are some starting points:
Django and Cloud SQL support on App Engine
Running Pure Django Projects on Google App Engine
Third-party Libraries in Python 2.7
The last link shows which versions of Django are currently supported.

In WebJobs SDK, How to bind additional CloudStorageAccount for Blob output?

WebJobs SDK is doing wonderful job simplifying amount of code one need to write to save blobs to storage, but all within ONE storage account that is the default AzureJobsStorage.
Having everything (Queues,Blobs,Tables, and Heartbeats) in one storage account will throttle that account in medium-load production environment.
Of course, I can write legacy WindowsAzure.Storage code to save blobs to desired storage account, but I will loose the simplicity of the WebJobs SDK.
Appreciate any suggestions or advice.
Today, the WebJobs SDK supports only two Storage accounts per host:
AzureWebJobsStorage - used for your app's data
AzureWebJobsDashboard - used for logging (heartbeats, functions, etc) and dashboard indexing
The two accounts can be different if you want but that's all the separation you can do for now.
We have an item on the backlog to support multiple storage accounts for data but there is no ETA for it.
This is somewhat of a hack around the limitation, but let's say you want specific jobs associated with storage accounts (instead of one job accessing and writing to different storage accounts). You could open two different job hosts with different configs, but also create your own TypeLocator to filter which jobs are associated with specific hosts.