Caching documents for distributed processing in AWS - amazon-web-services

We've a set of microservices for document processing. One of the service will generate a document from template, another will convert to PDF and another one will save to database and so on. The microservices communicates through messages using SNS/SQS. Currently we are storing the temporary files / data generated by microservice in S3 since we cant send them as part of messages. There is a time/cost penalty in reading and writing these documents to S3 and also we want to keep them for a short time. Is S3 is the best option to store these kind of temporary data or is there any cache mechanism like Elastic cache we can leverage?
The files are like word, pdf docs size ranging from KBs to MBs

Related

Optimal way to use AWS S3 for a backend application

In order to learn how to connect backend to AWS, I am writing a simple notepad application. On the frontend it uses Editor.js as an alternative to traditional WYSIWYG. I am wondering how best to synchronise the images uploaded by a user.
To upload images from disk, I use the following plugin: https://github.com/editor-js/image
In the configuration of the tool, I give the api endpoint of the server to upload the image. The server in response have to send the url to the saved file. My server saves the data to s3 and returns the link.
But what if someone for example adds and removes the same file over and over again? Each time, there will be a new request to aws.
And here is the main part of the question, should I optimize it somehow in practice? I'm thinking of saving the files temporarily on my server first, and only doing a synchronization with aws from time to time. How this is done in practice? I would be very grateful if you could share with me any tips or resources that I may have missed.
I am sorry for possible mistakes in my English, i do my best.
Thank you for help!
I think you should upload them to S3 as soon as they are available. This way you are ensuring their availability and resistance to failure of you instance. S3 store files across multiple availability zones (AZs) ensuring reliable long-term storage. On the other hand, an instance operates only within one AZ and if something happens to it, all your data on the instance is lost. So potentially you can lost entire batch of images if you wait with the uploads.
In addition to that, S3 has virtually unlimited capacity, so you are not risking any storage shortage. When you keep them in batches on an instance, depending on the image sizes, there may be a scenario where you simply run out of space.
Finally, the good practice of developing apps on AWS is to make them stateless. This means that your instances should be considered disposable and interchangeable at any time. This is achieved by not storing any user data on the instances. This enables you to auto-scale your application and makes it fault tolerant.

What are the downsides of using filesystem for storing uploaded files in django?

I know about s3 storage but I am trying to see if things can work out by only using filesystem
The main reason to use a service like S3 is scalability. Imagine that you use a the file system of a simple server to store files. Then it means that everyone that visits your site and wants to access a file, has to visit the same server. If there are enough visitors, then this will eventually render the system unresponsive.
Scalable storage services will store the same data on multiple servers to allow serving the content when the number of requests increases. Furthermore one normally hits a server that is close to the location of that user which minimizes the delay to fetch a file.
Finally such storage services are more reliable. If you use a single disk to store all the files, it is possible that eventually the disk fails losing all the data. By storing the data on multiple locations, it is less likely that the files are completely lost.

Best way to stream or load audio files into S3 bucket (contact centre recordings)

What is the best way to with reliability get our client to send audio files to our S3 bucket that will process the audio files (ML processes that will do speech-to-text-insights)?
The files could be in .wav / mp3 other such audio formats. Also, some files may be larger in size.
Love to get best ideas? (e.g. API Gateway / Lambda / S3 ?) Would love to hear from anyone who may have done this before.
Some questions and answers to give context:
How do users interface with your system? We are looking for API based approach vs. a browser based approach. We can get browser based approach to work but not sure if that is the right technical/architectural / scalable approach
Do you require a bulk upload method? Yes. We would need bulk upload functionality and some individual files may be larger as well
Will it be controlled by a human, or do you want it to upload automatically somehow? Certainly want it automatically
ultimately, we are building a SaaS solution that will take the audio files and meta data and perform analytics on it and deliver results of our analysis through an API back to the App. So the approach we are looking for is something that will work within this context
I have a similar scenario.
If you intend to use Api Gateway/Lambda/s3 then you should know that there is a limit on the payload size that Gateway & Lambda can accept. Specifically, Api Gateway accepts payloads till 10 MB & Lambda till 6MB.
There is a workaround for this issue though. You can upload your files directly on an s3 bucket and attach a lambda trigger on object creation.
I'll leave some articles that may point you to the right direction :
Uploading a file using presigned URLs :
https://docs.aws.amazon.com/AmazonS3/latest/userguide/PresignedUrlUploadObject.html
Lambda trigger on s3 object creation: https://medium.com/analytics-vidhya/trigger-aws-lambda-function-to-store-audio-from-api-in-s3-bucket-b2bc191f23ec
A holistic view of the same issue: https://sookocheff.com/post/api/uploading-large-payloads-through-api-gateway/
Related GitHub issue :
https://github.com/serverless/examples/issues/106
So from my pov, regarding uploading files, the best way would be to return a pre-signed URL, then have the client upload the file directly to S3. Otherwise, you'll have to implement uploading the file in chunks.

How do I transfer images from public database to Google Cloud Bucket without downloading locally

I have a a csv file that has over 10,000 urls pointing to images on the internet. I want to perform some machine learning task on them. I am using Google Cloud Platform infrastructure for this task. My first task is to transfer all this images from the urls to a GCP bucket, so that I can access them later via docker containers.
I do not want to download them locally first and then upload them as that is just too much work, instead just transfer them directly to bucket. I have looked at Storage Transfer Service and for my specific case I think, I will be using a URL list. Can anyone help me figure out how do I proceed next. Is this even a possible option?
If yes, how do I generate an MD5 has that is mentioned here for each url in my list and also get the number of bytes for image for each url ?
As you noted, Storage Transfer Service requires that you provide it with the MD5 of each file. Fortunately, many HTTP servers may provide you with the MD5 of an object without requiring that you download it. Issuing an HTTP HEAD request may result in the server providing you with a Content-MD5 header in its response, which may not be in the form that Storage Transfer service requires, but it can be converted into that form.
The downside here is that web servers are not necessarily going to provide you with that information. There's no way of knowing without checking.
Another option worth considering is to set up one or more GCE instances and run a script from there to download the objects to your GCE instance and from there upload them into GCS. This still involves downloading them "locally," but locally no longer means a place off of Google Cloud, which should speed things up substantially. You can also divide up the work by splitting your CSV file into, say, 10 files with 1000 objects each in them, and setting up 10 GCE instances to do the work.

where to store 10kb pieces of text in amazon aws?

These will be indexed and randomly accessed in a web app like SO questions. SimpleDB has a 1024-byte limit per attribute but you could use multiple attrs but sounds inelegant.
Examples: blog posts; facebook status messages; recipes (in a blogging application; facebook-like application; recipe web site).
If I were to build such an application on Amazon AWS, where/how should I store the pieces of text?
With S3, you could put all the actual files in S3, then index them with Amazon RDS, or Postgres on Heroku, or whatever suits you at that time.
Also, you can get the client to download the multi kB text blurbs directly from S3, so your app could just deliver URLs to the messages, thereby creating a massively parallel server - even if the main server is just a single thread on one machine, constructing the page from S3 asset URLs. S3 could store all assets, like images, etc.
The advantages are big. This also solves backup, etc. And allows you to play with many indexing and searching schemes. Search could for instance be done using Google...
I'd say you would want to look at Amazon RDS, running a relational database like MySQL in the cloud. A single DynamoDB read capacity unit can only (consistently) read a 1kb-item, that's probably not going to work for you.
Alternatively, you could store the text files in S3 and put pointers to these files in SimpleDB. It depends on a lot of factors which is going to be more cost-effective: how many files you add every day, how often these files are expected to change, how often they are requested, etc.
Personally, I think that using S3 would not be the best approach. If you store all questions and answers in separate text files, you're looking at a number of requests for displaying even a simple page. Let alone search, which would require you to fetch all the files from S3 and search through them. So for search, you need a database anyway.
You could use SDB for keeping an index but frankly, I would just use MySQL on Amazon RDS (there's a free two-month trial period right now, I think) where you can do all the nice things that relational databases can do, and which also offers support for full-text search. RDS should be able to scale up to huge numbers of visitors every day: you can easily scale up all the way to a High-Memory Quadruple Extra Large DB Instance with 68 GB of memory and 26 ECUs.
As far as I know, SO is also built on top of a relational database: https://blog.stackoverflow.com/2008/09/what-was-stack-overflow-built-with/
DynamoDB is might be what you want, there is even a forum use case in their documentation: Example Tables and Data in Amazon DynamoDB
There is insufficient information in the question to provide a reasonable answer to "where should I store text that I'm going to use?"
Depending on how you build your application and what the requirements are for speed, redundancy, latency, volume, scalability, size, cost, robustness, reliability, searchability, modifiability, security, etc., the answer could be any of:
Drop the text in files on an EBS volume attached to an instance.
Drop the text into a MySQL or RDS database.
Drop the text into a distributed file system spread across multiple instances.
Upload the text to S3
Store the text in SimpleDB
Store the text in DynamoDB
Cache the text in ElastiCache
There are also a number of variations on this like storing the master copy in S3, caching copies in ElastiCache and on the local disk, indexing it with specific keys in DynamoDB and making it searchable in Cloud Search.