Service on amazon to download temporary files - amazon-web-services

I have to decide one thing, and would be very glad if some one could help me with that.
So the thing is we have an infrastructure on Amazon. Back-end processes write multiple files to the S3. Than when customer request a report, - we launch EMR job and create a result. So, the question is how to give this report file back to the customer?
What I would like to have is some temporary storage, that will give a unique url, that customer can download it from.
I was also thinking about storing result file on S3, but don't know if it's a good idea.
Is there some kind of a service on amazon that can help me with that?

You can create a signed URL (that you can later shorten if needed) to download the results file from S3: http://s3.amazonaws.com/doc/s3-developer-guide/RESTAuthentication.html

Related

Streaming media to files in AWS S3

My problem:
I want to stream media I record on the client (typescript code) to my AWS storage (services like YouTube / Twitch / Zoom / Google Meet can live record and save the record to their cloud. Some of them even have host-failure tolerance and create a file if the host has disconnected).
I want each stream to have a different file name so future triggers will be available from it.
I tried to save the stream into S3, but maybe there are more recommended storage solutions for my problems.
What services I tried:
S3: I tried to stream directly into S3 but it doesn't really support updating files.
I tried multi-part files but they are not host-failure tolerance.
I tried to upload each part and have a lambda to merge it (yes, it is very dirty and consuming) but I sometimes had ordering problems.
Kinesis-Video: I tried to use kinesis-video but couldn't enable the saving feature with the SDK.
By hand, I saw it saved a new file after a period of time or after a size was reached so maybe it is not my wanted solution.
Amazon IVS: I tried it because Twitch recommended this although it is way over my requirements.
I couldn't find an example of what I want to do in code with SDK (only by hand examples).
Questions
Do I look at the right services?
What can I do with the AWS-SDK to make it work?
Is there a good place with code examples for future problems? Or maybe a way to search for solutions?
Thank you for your help.

Make static file accessible to only my app hosted on either AWS or Google Cloud

I have a static site (purely HTML, JS & CSS) I will be hosting on either AWS or Google Cloud.
The site pulls data from a CSV that can either be located as a local file on the site, or preferably, on another endpoint.
My issue is, the client does not want the CSV file to be publicly accessible (people shouldn't be able to directly get to it, and download it).
The file needs to live on AWS S3 or Google Cloud Storage, as the client will be updating it periodically.
However, I can't seem to work out how to make it visible to my app, but not if you try to visit the file directly. I can either make it public, so my app can see it, but then so can everyone else. Or make it not public, so it can't be downloaded, but then my app also can't see it.
My question is - is what I'm trying to achieve even possible? Or does the CSV have to either be public or not?
My ideal option would be two separate buckets, one with my static site on, the other, with the CSV files.
Any suggestions would be most welcome.
What you're asking for isn't really possible in the way you've described it. If the CSV is to be consumed directly by the web page, then the web page needs to be able to get it, and if the web page can get it then so can anyone who can view the page. The same is true of any data that is on a web page. Is there a particular reason why you don't want the CSV to be accessed directly? It wouldn't be something you could just go to, you'd have to know the URI (which would be easy to find, but most users wouldn't bother). Are there things in the data that shouldn't be exposed? If so, you need to rethink your approach entirely.
AWS recently released new S3 features that can help with your need. One is Amazon S3 Access Point (https://aws.amazon.com/s3/features/access-points). The other is called Amazon S3 Object Lambda (https://aws.amazon.com/es/blogs/aws/introducing-amazon-s3-object-lambda-use-your-code-to-process-data-as-it-is-being-retrieved-from-s3/).
It looks like you can put Lambdas in front of an S3 bucket to process and transform requests to its files (also known as "objects"). Unfortunately, I do not have a precise answer to your question as I have never implemented this solution but I believe that you might be able to store the CSV file in S3 and give access to your static site only thanks to an S3 Access Point.
Alternatively, you might be able to use CloudFront and Lambda Associations to also restrict access to your CSV files to specific origins.

How do I transfer images from public database to Google Cloud Bucket without downloading locally

I have a a csv file that has over 10,000 urls pointing to images on the internet. I want to perform some machine learning task on them. I am using Google Cloud Platform infrastructure for this task. My first task is to transfer all this images from the urls to a GCP bucket, so that I can access them later via docker containers.
I do not want to download them locally first and then upload them as that is just too much work, instead just transfer them directly to bucket. I have looked at Storage Transfer Service and for my specific case I think, I will be using a URL list. Can anyone help me figure out how do I proceed next. Is this even a possible option?
If yes, how do I generate an MD5 has that is mentioned here for each url in my list and also get the number of bytes for image for each url ?
As you noted, Storage Transfer Service requires that you provide it with the MD5 of each file. Fortunately, many HTTP servers may provide you with the MD5 of an object without requiring that you download it. Issuing an HTTP HEAD request may result in the server providing you with a Content-MD5 header in its response, which may not be in the form that Storage Transfer service requires, but it can be converted into that form.
The downside here is that web servers are not necessarily going to provide you with that information. There's no way of knowing without checking.
Another option worth considering is to set up one or more GCE instances and run a script from there to download the objects to your GCE instance and from there upload them into GCS. This still involves downloading them "locally," but locally no longer means a place off of Google Cloud, which should speed things up substantially. You can also divide up the work by splitting your CSV file into, say, 10 files with 1000 objects each in them, and setting up 10 GCE instances to do the work.

Transfer file from AWS S3 to OneDrive with AWS Lambda

A client of ours requested that we have copies of their files on both AWS S3 and OneDrive.
The usual MO: File is sent from an iOS application to an AWS S3 bucket. This triggers an AWS Lambda Function which attaches the file to an email and sends a copy to the client, which they again store on OneDrive. Now, we want to skip the email part and transfer the file directly to OneDrive.
All my research so far points to Zapier or CloudRail or MS Graph REST Api. The problem I'm having is that we want to transfer the file with an AWS Lambda function (Java8), automagically. Almost all the tutorials and examples on MS Graph needs a client to log in manually. Mostly client side logic. The other methods have more overhead, and we don't (unnecessarily) want to make our stack more complicated than it already is.
I realize this is a very specific case. We are systematically replacing the client's file management system, without disrupting their day-to-day operations too much.
Any conclusive pointers/examples/tutorials to get this done server side would be greatly appreciated.
I'm not sure how well S3 aligns with OneDrive, they are quite different models. OneDrive is provisioned by user which begs the question, which user would you want to copy this file too? I would think Azure Storage would be a far better fit as it uses a similar model to S3.
You can use Microsoft Graph API to upload the file to a user's OneDrive. You would need to authenticate the user in order to obtain an Access and Refresh Token. Once this process is done, you can store that Refresh Token and retrieve an updated Access Token as needed.
Also with CloudRail it's necessary to authenticate the user, but there are methods to store and use an access token.
The services have two methods, loadAsString and saveAsString, and they are used to store and load credentials. You could call loadAsString with your access token, the string can be different from service to service, but will look something like this: [{“access_token”: “YOUR ACCESS TOKEN”}]
To add to this, Microsoft now has a cloud migration tool www.mover.io that allows you to sync files & folders from most clouds into Azure blob, Sharepoint or OneDrive directly, so without download/upload to a client machine.
Personally used it only for a one-time sync, but leaving it here for posterity.
The client only has to login once so if you already have the client and secret keys, you can do the manual flow once then save the generated token file together with your code files in AWS. Next time the code is ran, it uses the refresh token. Last time I did this I was able to set the refresh token to never expire but I think Microsoft has randomly removed that option and now the token can only last something like 2 or 3 years max

Log delay in Amazon S3

I have recently hosted in Amazon S3, and I need the log files to calculate the statistics for the "get", "put", "list" operations in the objects.
And I've observed that the log files are organized weirdly. I don't know when the log will appear(not immediatly, at least 20 minutes after the operation) and how many lines of logs will be contained in one log file.
After that, I need to download these log files and analyse them. But I can't figure out how often I will do this.
Can somebody help? Thanks.
What you describe (log files being made available with delays and being in unpredictable order) is exactly what is declared by AWS as behaviour to expect. This is by nature of distributed system, AWS S3 is using to provide S3 service, the same request may be served each time from different server - I have seen 5 different IP addresses being provided for publishing.
So the only solution is: accept the delay, see the delay you experience and add some extra time and learn living with this total delay (I would expect something like 30 to 60 minutes, but statistics could tell more).
If you need log records ordered, you have either sort them yourself, or search for some log processing solutions - I have seen some applications being offered exactly for this purpose.
In case, you really need to get your log file with very short delay, you have to make the logs yourself and this means, you have to write and run some frontend, which gives access to your files on S3 and at the same time keeps logging as needed.
I run such a solution, users get user name and password and url of my frontend. As they send the request, I evaluate, if they provide proper credentials and if they are allowed to see given resource, and if so, I create few minutes valid temporary url for that resource and redirect the request to that.
But such a fronted costs money (you have to run your frontend somewhere) and is less robust, then accessing directly the AWS S3.
Good luck, Lulu.
A lot has changed since the time that the question was originally posted. The delay is still there, but one of OP concerns was when to download the logs to analyze them.
One option right now would be to leverage Event Notifications: https://docs.aws.amazon.com/AmazonS3/latest/user-guide/setup-event-notification-destination.html
This way, whenever an object is created in the access logs bucket, you can trigger a notification either to SNS, SQS or Lamba, and based on that download and analyze the log files.