Transfer file from AWS S3 to OneDrive with AWS Lambda - amazon-web-services

A client of ours requested that we have copies of their files on both AWS S3 and OneDrive.
The usual MO: File is sent from an iOS application to an AWS S3 bucket. This triggers an AWS Lambda Function which attaches the file to an email and sends a copy to the client, which they again store on OneDrive. Now, we want to skip the email part and transfer the file directly to OneDrive.
All my research so far points to Zapier or CloudRail or MS Graph REST Api. The problem I'm having is that we want to transfer the file with an AWS Lambda function (Java8), automagically. Almost all the tutorials and examples on MS Graph needs a client to log in manually. Mostly client side logic. The other methods have more overhead, and we don't (unnecessarily) want to make our stack more complicated than it already is.
I realize this is a very specific case. We are systematically replacing the client's file management system, without disrupting their day-to-day operations too much.
Any conclusive pointers/examples/tutorials to get this done server side would be greatly appreciated.

I'm not sure how well S3 aligns with OneDrive, they are quite different models. OneDrive is provisioned by user which begs the question, which user would you want to copy this file too? I would think Azure Storage would be a far better fit as it uses a similar model to S3.
You can use Microsoft Graph API to upload the file to a user's OneDrive. You would need to authenticate the user in order to obtain an Access and Refresh Token. Once this process is done, you can store that Refresh Token and retrieve an updated Access Token as needed.

Also with CloudRail it's necessary to authenticate the user, but there are methods to store and use an access token.
The services have two methods, loadAsString and saveAsString, and they are used to store and load credentials. You could call loadAsString with your access token, the string can be different from service to service, but will look something like this: [{“access_token”: “YOUR ACCESS TOKEN”}]

To add to this, Microsoft now has a cloud migration tool www.mover.io that allows you to sync files & folders from most clouds into Azure blob, Sharepoint or OneDrive directly, so without download/upload to a client machine.
Personally used it only for a one-time sync, but leaving it here for posterity.

The client only has to login once so if you already have the client and secret keys, you can do the manual flow once then save the generated token file together with your code files in AWS. Next time the code is ran, it uses the refresh token. Last time I did this I was able to set the refresh token to never expire but I think Microsoft has randomly removed that option and now the token can only last something like 2 or 3 years max

Related

Make static file accessible to only my app hosted on either AWS or Google Cloud

I have a static site (purely HTML, JS & CSS) I will be hosting on either AWS or Google Cloud.
The site pulls data from a CSV that can either be located as a local file on the site, or preferably, on another endpoint.
My issue is, the client does not want the CSV file to be publicly accessible (people shouldn't be able to directly get to it, and download it).
The file needs to live on AWS S3 or Google Cloud Storage, as the client will be updating it periodically.
However, I can't seem to work out how to make it visible to my app, but not if you try to visit the file directly. I can either make it public, so my app can see it, but then so can everyone else. Or make it not public, so it can't be downloaded, but then my app also can't see it.
My question is - is what I'm trying to achieve even possible? Or does the CSV have to either be public or not?
My ideal option would be two separate buckets, one with my static site on, the other, with the CSV files.
Any suggestions would be most welcome.
What you're asking for isn't really possible in the way you've described it. If the CSV is to be consumed directly by the web page, then the web page needs to be able to get it, and if the web page can get it then so can anyone who can view the page. The same is true of any data that is on a web page. Is there a particular reason why you don't want the CSV to be accessed directly? It wouldn't be something you could just go to, you'd have to know the URI (which would be easy to find, but most users wouldn't bother). Are there things in the data that shouldn't be exposed? If so, you need to rethink your approach entirely.
AWS recently released new S3 features that can help with your need. One is Amazon S3 Access Point (https://aws.amazon.com/s3/features/access-points). The other is called Amazon S3 Object Lambda (https://aws.amazon.com/es/blogs/aws/introducing-amazon-s3-object-lambda-use-your-code-to-process-data-as-it-is-being-retrieved-from-s3/).
It looks like you can put Lambdas in front of an S3 bucket to process and transform requests to its files (also known as "objects"). Unfortunately, I do not have a precise answer to your question as I have never implemented this solution but I believe that you might be able to store the CSV file in S3 and give access to your static site only thanks to an S3 Access Point.
Alternatively, you might be able to use CloudFront and Lambda Associations to also restrict access to your CSV files to specific origins.

What happens if you’re in the middle of a process when AWSAssumeRole times out?

I’m currently working with a role that I need to assume to access certain buckets on S3.
I was wondering, if the duration given to an STSAssumeRoleSessiomCredentialsProvider is 1 hour and you’re doing something like downloading a file that takes 1.5 hours, does it finish the process or does it stop in the middle because the duration ended?
The validity of the credentials is verified when the request is initiated. Once initiated successfully, the response will be sent completely. In your download example case, if the credentials were valid when the download request was initiated, that is sufficient for the file to be downloaded completely.
The STS credentials expiry is a problem where repeated connections are made to AWS as part of a long running program and the program reads the credentials at the beginning and stores them. It is generally a good practice to decouple the sts-credential-acquisition process from the users of those credentials and the users should ensure the credentials are always read when the underlying source of credentials (typically a file) is modified.
These aspects are handled by AWS Java SDK's ProfileCredentialsProvider class automatically. Not sure if a similar module exists in other language bindings too.
Credentials are validated when presented on an API call. If you make your API call(s) before the credentials expire then you are fine.
If, however, you need to make multiple API calls, and one of them exceeds the expiration time, then that call will fail.
This is particularly relevant to S3 multi-part uploads, each part of which is a distinct API call, and which presents credentials each time. The solution to this generally is one of:
get credentials that are valid for long enough to complete the
operation
refresh credentials when you are close to expiration and
use the new credentials for subsequent part uploads

Why does aws s3 getObject executes slowly even with small files?

I am relatively new to amazon web services. There is problem that came up while I was coding my new web app. I am currently storing profile pictures in an s3 bucket.
I don’t want these profile pictures to be seen by the public, only authorized members. So I have a php file like this:
This php file executes getObject and sends out a header to show the picture but only if the user is allowed to see the picture. I query the database and also check session to make sure that the currently logged in user has access to the picture. All is working fine, but it takes around 500 milliseconds to the get request to execute, even on small files (40kb). On bigger files it gets even longer as well as if I embed the php file in an img tag multiple times with different query string values.
I need to mention that I’m testing this in a localhost environment with apache webserver.
Could be the the problem is that getObject is optimized to be run from an ec2 instance and that if I would test this on an ec2 the response time is much better?
My s3 is based in London, and I’m testing it in Hungary with a good internet connection so I’m not sure if this response time is what I should get here.
I read that other people had similar issues, but from my understanding the time it takes from s3 to transfer the files to an ec2 should be minimal as they are all in the cloud and the latency between these services and all the other aws services should be minimal (At least if they are in the same region).
Please don’t tell me in comments that I should just make my bucket public and embed the direct link to the file as it is not a viable option for obvious reasons. I also don’t want to generate pre-signed urls for various reasons.
I also tested this without querying the database and essentially the only logic in my code is to get the object and show it to the user. Even with this I get 400+ milliseconds response time.
I also tried using doesObjectExist() and I still need to wait around 300-400 milliseconds for that to give me a response.
Multiple get request to the same php file as image source
UPDATE
I tested it on my ec2 instance and I've got much better response time. I tested it with multiple files and all is fine. It seems like that if you use getObject on localhost, the time it takes to connect to s3 and fetch the data multiplies.
Thank you for the answers!

How do I transfer images from public database to Google Cloud Bucket without downloading locally

I have a a csv file that has over 10,000 urls pointing to images on the internet. I want to perform some machine learning task on them. I am using Google Cloud Platform infrastructure for this task. My first task is to transfer all this images from the urls to a GCP bucket, so that I can access them later via docker containers.
I do not want to download them locally first and then upload them as that is just too much work, instead just transfer them directly to bucket. I have looked at Storage Transfer Service and for my specific case I think, I will be using a URL list. Can anyone help me figure out how do I proceed next. Is this even a possible option?
If yes, how do I generate an MD5 has that is mentioned here for each url in my list and also get the number of bytes for image for each url ?
As you noted, Storage Transfer Service requires that you provide it with the MD5 of each file. Fortunately, many HTTP servers may provide you with the MD5 of an object without requiring that you download it. Issuing an HTTP HEAD request may result in the server providing you with a Content-MD5 header in its response, which may not be in the form that Storage Transfer service requires, but it can be converted into that form.
The downside here is that web servers are not necessarily going to provide you with that information. There's no way of knowing without checking.
Another option worth considering is to set up one or more GCE instances and run a script from there to download the objects to your GCE instance and from there upload them into GCS. This still involves downloading them "locally," but locally no longer means a place off of Google Cloud, which should speed things up substantially. You can also divide up the work by splitting your CSV file into, say, 10 files with 1000 objects each in them, and setting up 10 GCE instances to do the work.

Google Drive API to update permissions using File's Patch Endpoint

We are using Google Drive API to upload files and update permissions in our application. Requirement is to update the permissions ~60 users/groups.
There are three ways by which we can update permissions on a file :
Use File's Patch Endpoint
Use File's Update Endpoint
Use Permissions's Insert Endpoint
If we go with #3, we have to make ~60 calls based on the permission change which is not good actually as it has to make that much http calls and it affects the quota usage.
So we tried with #1, we provide the necessary input in "permissions" key. It returns 200 but the file is not shared as per the given input.
Is there anything that I am missing ?
Permissions.Insert is the only way to add permissions to a file; it's not feasible via operations on the Files API.
The Google Drive API does however support batching, which means that instead of sending 60 separate HTTP requests, you can send a single batch that contains 60 requests. This won't help with quota, but will likely perform better. More information here:
https://developers.google.com/drive/v3/web/batch