Running java operation directly on AWS S3 target bucket - amazon-web-services

We have the following workflow at my work:
Download the data from AWS s3 bucket to the workspace:
aws s3 cp --only-show-errors s3://bucket1
Unzip the data
unzip -q "/workspace/folder1/data.zip" -d "/workspace/folder2"
Run a java command
java -Xmx1024m -jar param1 etc...
Sync the archive back to the s3 target bucket
aws s3 sync --include #{archive.location} s3://bucket
As you can see that the downloading data from s3 bucket, unzipping, running some java operation on the data and copying back to s3 costs a lot of time and resources.
Hence, we are planning to unzip directly in the s3 target bucket and run java operation there. Would it be possible to run the java operation directly in s3 bucket? If yes, could you please provide some insights?

Its not possible to run the java 'in S3', but what you can do is move your Java code to an AWS Lambda function, and all the work can be done 'in the cloud', i.e., no need to download to a local machine, process and re-upload.
Without knowing the details of you requirements, I would consider setting up an S3 notification request that gets invoked each time a new file gets PUT into a particular location, and AWS Lambda function that gets invoked with the details of that new file, and then have Lambda output the results to a different bucket/location with the results.
I have done similar things (though not with java) and have found it rock solid way of processing files.

No.
You cannot run code on S3.

S3 is an object store, which don't provide any executing environment. To do any modifications to the files, you need to download it, modify and upload back to S3.
If you need to do operations on files, you can look into using AWS Elastic File System which you can mount to your EC2 instance and do the operations as required.

Related

How to automatically sync s3 bucket to a local folder using windows server

Im trying to have a replica of my s3 bucket in a local folder. it should be updated when a change occurs on the bucket.
You can use the aws cli s3 sync command to copy ('synchronize') files from an Amazon S3 bucket to a local drive.
To have it update frequently, you could schedule it as a Windows Scheduled Tasks. Please note that it will be making frequent calls to AWS, which will incur API charges ($0.005 per 1000 requests).
Alternatively, you could use utilities that 'mount' an Amazon S3 bucket as a drive (Tntdrive, Cloudberry, Mountain Duck, etc). I'm not sure how they detect changes -- they possibly create a 'virtual drive' where the data is not actually downloaded until it is accessed.
You can use rclone and Winfsp to mount S3 as a drive.
Though this might not be a 'mount' in traditional terms.
You will need to setup a task scheduler for a continuous sync.
Example : https://blog.spikeseed.cloud/mount-s3-as-a-disk/

AWS CLI listing all the files within a S3 Bucket

I am practicing AWS commands. My client has given me the AWS IAM access key and the secret but not the account that I can log in to the admin panel. Those keys are being used with the project itself. What I am trying to do is that I am trying to list down all the files recursive within a S3 bucket.
This is what I have done so far.
I have configured the AWS profile for CLI using the following command
aws configure
Then I could list all the available buckets by running the following command
aws s3 ls
Then I am trying to list all the files within a bucket. I tried running the following command.
aws s3 ls s3://my-bucket-name
But it seems like it is not giving me the correct content. Also, I need a way to navigate around the bucket too. How can I do that?
You want to list all of the objects recursively but aren't using --recursive flag. This will only show prefixes and any objects at the root level
Relevant docs https://docs.aws.amazon.com/cli/latest/reference/s3/ls.html
A few options.
Roll your own
if you run an aws s3 ls and a line item has the word "PRE" instead of a modify date and size, that means it's a "directory" that you can traverse. You can write a quick bash script to run recursive aws s3 ls commands on everything that returns "PRE" indicating it's hiding more files.
s3fs
Using the S3FS-Fuse project on github, you can mount an S3 bucket on your file system and explore it that way. I haven't tested this and thus can't personally recommend it, but it seems viable, and might be a simple way to use tools you already have and understand (like tree).
One concern that I might have, is when I've used software similar to this it has made a lot of API calls and if left mounted for the long-term, it can run up costs just by the number of API calls.
Sync everything to a localhost (not recommended)
Adding this for completion, but you can run
aws s3 sync s3://mybucket/ ./
This will try to copy everything to your computer and you'll be able to use your own filesystem. However, s3 buckets can hold petabytes of data, so you may not be able to sync it all to your system. Also, s3 provides a lot of strong security precautions to protect the data, which your personal computer probably doesn't.

Download millions of records from s3 bucket based on modified date

I am trying to download millions of records from s3 bucket to NAS. Because there is not particular pattern for filenames, I can rely solely on modified date to execute multiple CLI's in parallel for quicker download. I am unable to find any help to download files based on modified date. Any inputs would be highly appreciated!
someone mentioned about using s3api, but not sure how to use s3api with cp or sync command to download files.
current command:
aws --endpoint-url http://example.com s3 cp s3:/objects/EOB/ \\images\OOSS\EOB --exclude "*" --include "Jun" --recursive
I think this is wrong because include here would be referring to inclusion of 'Jun' within the file name and not as modified date.
The AWS CLI will copy files in parallel.
Simply use aws s3 sync and it will do all the work for you. (I'm not sure why you are providing an --endpoint-url)
Worst case, if something goes wrong, just run the aws s3 sync command again.
It might take a while for the sync command to gather the list of objects, but just let it run.
If you find that there is a lot of network overhead due to so many small files, then you might consider:
Launch an Amazon EC2 instance in the same region (make it fairly big to get large network bandwidth; cost isn't a factor since it won't run for more than a few days)
Do an aws s3 sync to copy the files to the instance
Zip the files (probably better in several groups rather than one large zip)
Download the zip files via scp, or copy them back to S3 and download from there
This way, you are minimizing the chatter and bandwidth going in/out of AWS.
I'm assuming you're looking to sync arbitrary date ranges, and not simply maintain a local synced copy of the entire bucket (which you could do with aws s3 sync).
You may have to drive this from an Amazon S3 Inventory. Use the inventory list, and specifically the last modified timestamps on objects, to build a list of objects that you need to process. Then partition those somehow and ship sub-lists off to some distributed/parallel process to get the objects.

Why can I not run dynamic content from Amazon S3?

I know that Amazon S3 is a service for storing static files. But what I don't understand is, if I store some PHP files on a S3 bucket, why isn't it possible to have those files executed from a EC2 instance?
Amazon S3 is a data storage service. When a file is requested from S3, it is sent to the requester, regardless of file format. S3 does not process the file in any way, nor does it pass content to Amazon EC2 for execution.
If you want a PHP file executed by a PHP engine, you will need to run a web server on an Amazon EC2 instance.
Run directly from S3 this will never work as objects in s3 aren't presented in a way whilst stored in s3 that your local system can really use.
However good news you can pull the php down from S3 to your local system and execute it!
I use this method myself with an instance created by lambda to do some file processing. Lambda creates the instance, the bash script in the instance UserData will do an s3 copy (see below) to copy the php file down and the data file down that PHP will process and then php is called against my file.
To download a file from s3 in the cli you:
//save as file.php in the current directory
aws s3 cp s3://my-s3-bucket-name/my/s3/file.php .
//or
//save as a different filename
aws s3 cp s3://my-s3-bucket-name/my/s3/file.php my-file.php
//or
//save it in a different folder
aws s3 cp s3://my-s3-bucket-name/my/s3/file.php some/directory/path/file.php
You would then pass this file into PHP for execution like any other file.

Output AWS CLI "sync" results to a txt file

I'm new to AWS and specifically to the AWS CLI tool, but so far I seem to be going OK.
I'm using the following commands to connect to AWS S3 and synchronise a local directory to my S3 bucket:
set AWS_ACCESS_KEY_ID=AKIAIMYACCESSKEY
set AWS_SECRET_ACCESS_KEY=NLnfMySecretAccessCode
set AWS_DEFAULT_REGION=ap-southeast-2
aws s3 sync C:\somefolder\Data\Dist\ s3://my.bucket/somefolder/Dist/ --delete
This is uploading files OK and displaying the progress and result for each file.
Once the initial upload is done, I'm assuming that all new syncs will just upload new and modified files and folders. Using the --delete will remove anything in the bucket that no longer exists on the local server.
I'd like to be able to output the results of each upload (or download in the case of other servers which will be getting a copy of what is being uploaded) to a .txt file on the local computer so that I can use blat.exe to email the contents to someone who will be monitoring the sync.
All of this will be put into a batch file that will be scheduled to run nightly.
Can the output to .txt be done? If so, how?
I haven't tested this myself, but I found some resources that indicate you can redirect output from command-line driven applications in Windows command prompt just like you would in linux.
aws s3 sync C:\somefolder\Data\Dist\ s3://my.bucket/somefolder/Dist/ --delete > output.txt
The resources I found are:
https://stackoverflow.com/a/16713357/4471711
https://www.microsoft.com/resources/documentation/windows/xp/all/proddocs/en-us/redirection.mspx?mfr=true
Once the initial upload is done, I'm assuming that all new syncs will just upload new and modified files and folders. Using the --delete will remove anything in the bucket that no longer exists on the local server.
That is correct, sync will upload either new or modified files as compared to the destination (whether it is an S3 bucket or your local machine).
--delete will remove anything in the destination (not necessarily an S3 bucket) that is not in the source. It should be used carefully so as to avoid a situation where you've downloaded, modified and sync one file and because your local machine doesn't have ALL of the files, use of the --delete flag will then delete all other files at destination.