Zip images on web server and return the url - google-cloud-platform

I am currently looking for a way to improve the traffic flow of an app.
Currently the user uploads his data via the app, using Google Cloud Platform as storage provider. Other users can then download this data again.
This works well so far, but since the download traffic at GCP is relatively expensive I had the idea to outsource this to a cheap web server.
The idea is that the user requests the file(s) at GCP. There it is checked if the file(s) are already on the web server. If not, the file(s) will be uploaded to the server.
At the server the files are zipped and the link is sent back to GCP, where it is emailed to the user.
TL:DR My question is, how can i zip a specific selection of files on a web server without nodejs etc. and send the link of the generated file back to GCP
I'm open for other ideas aswell

This is a particular case, covered by Google Cloud CDN (Content Delivery Network) service.
As you can read here, there already is a way to connect the CDN to a Storage bucket, and it will do exactly what you've thought to do with your own web server. The only difference is that it's already production ready. It handles cache misses, cache hits and so on.
You can compare the prices: here you can find CDN prices, and here you can find Storage prices. The important difference is that Storage costs per TB of Egress, meanwhile CDN costs per 10TB of Egress, and the price is still lower.
Of course, you can still stick to your idea. I would implement it by developing a REST API. The API, with just one endpoint will serve the file, if it is present on the web server. If it is not present, it will:
perform a redirect to the direct link for the file hosted in Storage;
start to fetch the file form Storage and put it in the cache.
You would still need to handle the cache: what happens when somebody changes a file? That's something related to the way you're working with those files, so it strictly depends on your app functional domain, and in any case, Cloud CDN would solve it without any further development.

Related

How to design scalable video streaming architecture using GCP?

I have a video streaming application which does streaming the video from google storage bucket. All the files which reside on the storage bucket are not public. Every time when users click on a video from the front-end I am generating a signed URL using API and load into the HTML5 video player.
Problem
I see if the file size is more than 100 MB it takes around 30-40 sec to load the video on front-end.
When I googled to resolved this problem, some of the articles are saying use cloud CDN and storage bucket then cache the file. As far as I know, to cache the file, the file has to publicly available. I can't make files publicly available.
So my concern is, are there any ways where we can make it scalable/ reduce the initial time?
Cloud CDN will help your latency for sure. Also, with that amount of latency it might be good to look into the actual requests that are being sent to Cloud Storage to make sure chunks are being requested and that the whole video file isn't being loaded before starting to play.
Caching the file does not require that the file is public. You can make the file private and add the Cloud CDN service into your Cloud Storage ACLs (https://cloud.google.com/cdn/docs/using-signed-urls#configuring_permissions). Also, as Kolban noted above, signed cookies might be better for your application to streamline the requests.
Not an exact answer but this site is useful to design solution using GCP.
https://gcp.solutions/diagram/media-transcoding
As mentioned earlier, CDN is right way to go for video streaming with low latency.

best practice for streaming images in S3 to clients through a server

I am trying to find the best practice for streaming images from s3 to client's app.
I created a grid-like layout using flutter on a mobile device (similar to instagram). How can my client access all its images?
Here is my current setup: Client opens its profile screen (which contains the grid like layout for all images sorted by timestamp). This automatically requests all images from the server. My python3 backend server uses boto3 to access S3 and dynamodb tables. Dynamodb table has a list of all image paths client uploaded, sorted by timestamp. Once I get the paths, I use that to download all images to my server first and then send it to the client.
Basically my server is the middleman downloading the sending the images back to the client. Is this the right way of doing it? It seems that if the client accesses S3 directly, it'll be faster but I'm not sure if that is safe. Plus I don't know how I can give clients access to S3 without giving them aws credentials...
Any suggestions would be appreciated. Thank you in advance!
What you are doing will work, and it's probably the best option if you are optimising for getting something working quickly, w/o worrying too much about waste of server resources, unnecessary computation, and if you don't have scalability concerns.
However, if you're worrying about scalability and lower latency, as well as secure access to these image resources, you might want to improve your current architecture.
Once I get the paths, I use that to download all images to my server first and then send it to the client.
This part is the first part I would try to get rid of as you don't really need your backend to download these images, and stream them itself. However, it seems still necessary to control the access to resources based on who owns them. I would consider switching this to below setup to improve on latency, and spend less server resources to make this work:
Once I get the paths in your backend service, generate Presigned urls for s3 objects which will give your client temporary access to these resources (depending on your needs, you can adjust the time frame of how long you want a URL access to work).
Then, send these links to your client so that it can directly stream the URLs from S3, rather than your server becoming the middle man for this.
Once you have this setup working, I would try to consider using Amazon CloudFront to improve access to your objects though the CDN capabilities that CloudFront gives you, especially if your clients distributed in different geographical regions. AFA I can see, you can also make CloudFront work with presigned URLs.
Is this the right way of doing it? It seems that if the client accesses S3 directly, it'll be faster but I'm not sure if that is safe
Presigned URLs is your way of mitigating the uncontrolled access to your S3 objects. You probably need to worry about edge cases though (e.g. how the clients should act when their access to an S3 object has expired, so that users won't notice this, etc.). All of these are costs of making something working in scale, if you have that scalability concerns.

How do I transfer images from public database to Google Cloud Bucket without downloading locally

I have a a csv file that has over 10,000 urls pointing to images on the internet. I want to perform some machine learning task on them. I am using Google Cloud Platform infrastructure for this task. My first task is to transfer all this images from the urls to a GCP bucket, so that I can access them later via docker containers.
I do not want to download them locally first and then upload them as that is just too much work, instead just transfer them directly to bucket. I have looked at Storage Transfer Service and for my specific case I think, I will be using a URL list. Can anyone help me figure out how do I proceed next. Is this even a possible option?
If yes, how do I generate an MD5 has that is mentioned here for each url in my list and also get the number of bytes for image for each url ?
As you noted, Storage Transfer Service requires that you provide it with the MD5 of each file. Fortunately, many HTTP servers may provide you with the MD5 of an object without requiring that you download it. Issuing an HTTP HEAD request may result in the server providing you with a Content-MD5 header in its response, which may not be in the form that Storage Transfer service requires, but it can be converted into that form.
The downside here is that web servers are not necessarily going to provide you with that information. There's no way of knowing without checking.
Another option worth considering is to set up one or more GCE instances and run a script from there to download the objects to your GCE instance and from there upload them into GCS. This still involves downloading them "locally," but locally no longer means a place off of Google Cloud, which should speed things up substantially. You can also divide up the work by splitting your CSV file into, say, 10 files with 1000 objects each in them, and setting up 10 GCE instances to do the work.

What are some of the most appropriate ways for serving a large scale django app on Google Compute Engine?

I am working on a project that will presumably have a lot of user uploaded content and also a fairly large user base. I am now looking for deploying this app to the Google Compute Engine.
I have looked up for the possible options and nginx+gunicorn seems to be a good option. In the beginning I am going to be using a single ns-1 instance with 100 GB persistent hard drive and google cloud sql for serving my database.
But I want to make things scalable so that I can add more instances and disk storage without any hustle in the future. But I am very confused how to do that. So the main concern is.
I want such setup so that I can extend my disk space and no. of Google Compute Instances whenever I want.
In order to have a fully scalable architecture, a good approach is to separate computation / serving, from file storage, and both from data storage. Going part by part:
file storage - Google Cloud Storage - by storing common service files in a GCS bucket, you get a central repository that is both highly-redundant, and scalable;
data storage - Google Cloud SQL - gives you a highly reliable, scalable MySQL-like database back-end, which can be resized at will to accommodate increasing database usage;
front-ends - GCE instance group - template-generated web / computation front-ends, setting up a resource pool into which a forwarding rule (load balancer) distributes incoming connections.
In a nutshell, this is one of the most adaptable set-ups I can think of, while you keep control over every aspect of the service and underlying infrastructure.
A simple approach would be to run a Python app on Google App Engine, which will auto-scale your instances (both up and down) and it supports Django, as mentioned by #spirulence in the comments.
Here are some starting points:
Django and Cloud SQL support on App Engine
Running Pure Django Projects on Google App Engine
Third-party Libraries in Python 2.7
The last link shows which versions of Django are currently supported.

Are there monitor tools for AWS S3 and CloudFront

I am using the amazon services S3 and CloudFront for a web application and I would like to have various statistics about accessing the data that I am providing through the logs of those services (there is logging activated in both services).
I did a bit of googling and the only thing I could find is how to manage my S3 storage. I also noticed that newrelic offers monitoring for many amazon services but not for those 2.
Is there something that you use? A service that could read my logs periodically and provide me with some nice analytics that would make developers and managers happy?
e.g.
I am trying to avoid writing my own log parsers.
I believe Piwik supports the Amazon S3 log format. Take a look at their demo site to see some example reports.
Well, this may not be what you expect but I use qloudstat for my cloudfront distributions.
The $5 plan covers my needs thats less than a burrito here where I live.
Best regards.
Well, we have a SaaS product Cloudlytics which offers you many reports including, Geo, IP tracking, SPAM, CloudFront cost analysis. You can try it for free for upto 25 MB of logs.
I might be answering this very late. But I have worked on a golang library that can run analysis of CDN and S3 usages and store them in a backend of your choice varying from influxdb, MongoDB or Cassandra for later time series evaluations. The project is hosted at http://github.com/meson10/cdnlysis
See if this fits.
Popular 3rd party analytics packages include S3stat, Cloudlytics and Qloudstat. They all run around $10/month for low traffic sites.
Several stand-alone analytics packages support Amazon's logfile format if you want to download logs each night and feed them in directly. Others might need pre-processing to transform to Combined Logfile Format (CLF) first.
I've written about how to do that here:
https://www.expatsoftware.com/articles/2007/11/roll-your-own-web-stats-for-amazon-s3.html