Is it possible to run Trino on top of pure AWS S3 without any other additional engine? In the Trino connectors there is no S3, but in the docs it is mentioned it could be run over S3 or e.g. Hive. So do I need some layer over S3 such as Hadoop/Hive or something like that, or it is possible to use Trino just with S3 as is?
Trino can use S3 as a storage mechanism through the Hive connector. But S3 itself is only for object (basically files) storage - there is not a server type component. You must have a server process running somewhere as either a Linux process or a Docker image.
Related
Im trying to have a replica of my s3 bucket in a local folder. it should be updated when a change occurs on the bucket.
You can use the aws cli s3 sync command to copy ('synchronize') files from an Amazon S3 bucket to a local drive.
To have it update frequently, you could schedule it as a Windows Scheduled Tasks. Please note that it will be making frequent calls to AWS, which will incur API charges ($0.005 per 1000 requests).
Alternatively, you could use utilities that 'mount' an Amazon S3 bucket as a drive (Tntdrive, Cloudberry, Mountain Duck, etc). I'm not sure how they detect changes -- they possibly create a 'virtual drive' where the data is not actually downloaded until it is accessed.
You can use rclone and Winfsp to mount S3 as a drive.
Though this might not be a 'mount' in traditional terms.
You will need to setup a task scheduler for a continuous sync.
Example : https://blog.spikeseed.cloud/mount-s3-as-a-disk/
I have a requirement where we need to move files from on-prem NAS storage to AWS S3.
Files keep coming on NAS storage when it arrives we have notification set up in AWS and then we need to pull files from AWS to S3.
Can I access NAS storage and pull files from AWS to S3?
Does it require any additional configuration or simple EC2 or Lambda function can work based on size of the file?
How about NAS --> SFTP --> S3 using AWS Transfer family solution.
Is there any better way to move files from NAS to S3?
We want to avoid writing code as much as we can.
You should take a look at AWS Datasync.
It is a data transfer service of AWS that allow to copy data to and from AWS storage services over the Internet or over AWS Direct Connect (protocols NFS, SMB).
You don't need EC2 or AWS lambda. You have to install an agent that will read from a source location, and sync your data to S3. The agent is deployed on-premise. Please find the supported Hypervisor here: https://docs.aws.amazon.com/datasync/latest/userguide/agent-requirements.html and the deployment guide here: https://docs.aws.amazon.com/datasync/latest/userguide/deploy-agents.html
Can we use amazon v4 API of amazon: https://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-post-example.html but I don't think it's useful for my purpose.
What I want is, there are some files on websites, I want those files to be uploaded in amazon s3 bucket without downloading them first on my local computer, current scenario is like this:
The third-party website provides downloaded link of file -> download file to my computer > upload to amazon s3
Can we eliminate the middle one so it become like this:
The third-party website provides downloaded link of file -> upload to amazon s3
You can't avoid the "download" part unless that "other website" is willing to do upload for you.
But you can eliminate your local network connection from the equation and do download/upload using EC2 instance in the same region as your bucket.
$ wget https://example.com/example.txt
$ aws s3 cp example.txt s3://mybucket
Your EC2 instance should have the role, allowing it to interact with S3.
You can do the same thing with Lambda, but you'll be limited by the size of the filesystem of the lambda runtime.
The third-party website provides downloaded link of file -> upload to amazon s3
If the 3rd party doesn't push the content "itself", you will need an actor/service/logic which downloads and uploads the data.
The "logic" means some compute resources - c2, ecs, lambda, batch.. it's the same download/upload process, just the traffic doesn't need to go through your computer. Every option has its pros and cons (e. g. lambda may be the cheapest for occasional tasks, but it has its limits)
You did not specify what initiates the upload (regular scan? event? on-demand?), that may affect your options too.
I'm setting up distributed Minio servers locally to use in a solution but would like to back them up to S3 regularly in case the local file system fails/just for more durability or just to migrate to AWS. The use case being we need an S3 compatible storage locally for regular access but would like the option of having it backed up in the cloud.
Wanted to check if anyone has tried something similar before or knows of something similar? Like a simple way/tool to keep Minio buckets in sync with your S3 buckets?
you can simply use mc mirror between minio/ to s3/
Have you checked out Rclone https://rclone.org/ - Rclone is a command line program to sync files and directories to and from: ... minio ... s3 (and lots more).
Combine this with either a batch or cron job or something based on minio notifications triggered on update/delete.
We have the following workflow at my work:
Download the data from AWS s3 bucket to the workspace:
aws s3 cp --only-show-errors s3://bucket1
Unzip the data
unzip -q "/workspace/folder1/data.zip" -d "/workspace/folder2"
Run a java command
java -Xmx1024m -jar param1 etc...
Sync the archive back to the s3 target bucket
aws s3 sync --include #{archive.location} s3://bucket
As you can see that the downloading data from s3 bucket, unzipping, running some java operation on the data and copying back to s3 costs a lot of time and resources.
Hence, we are planning to unzip directly in the s3 target bucket and run java operation there. Would it be possible to run the java operation directly in s3 bucket? If yes, could you please provide some insights?
Its not possible to run the java 'in S3', but what you can do is move your Java code to an AWS Lambda function, and all the work can be done 'in the cloud', i.e., no need to download to a local machine, process and re-upload.
Without knowing the details of you requirements, I would consider setting up an S3 notification request that gets invoked each time a new file gets PUT into a particular location, and AWS Lambda function that gets invoked with the details of that new file, and then have Lambda output the results to a different bucket/location with the results.
I have done similar things (though not with java) and have found it rock solid way of processing files.
No.
You cannot run code on S3.
S3 is an object store, which don't provide any executing environment. To do any modifications to the files, you need to download it, modify and upload back to S3.
If you need to do operations on files, you can look into using AWS Elastic File System which you can mount to your EC2 instance and do the operations as required.