Nextflow script with both 'local' and 'awsbatch' executor - aws-batch

I have a Nextflow pipeline executed in AWS Batch. Recently, I tried to add a process that uploads files from local machine to S3 bucket so I don't have to upload files manually before each run. I wrote a python script that handles the upload and I wrapped it into a Nextflow process. Since I am uploading from a local machine, I want the upload process with
executor 'local'
This requires a Fusion filesystem enabled in order to have a Work Dir in S3. But when I enable the Fusion filesystem I don't have access to my local filesystem. In my understanding, when Fusion filesystem is enabled, the task runs in Wave container without access to host filesystem. Does anyone have experience with running Nextflow with FusionFS enabled and how to access host filesystem? Thanks!

I don't think you need to manage a hybrid workload here. Pipeline inputs can be stored either locally or in an S3 bucket. If your files are stored locally and you specify a working directory in S3, Nextflow will already try to upload them into the staging area for you. For example, if you specify your working directory in S3 using -work-dir 's3://mybucket/work', Nextflow will try to stage the input files under s3://mybucket/work/stage-<session-uuid>. Once the files are in the staging area, Nextflow can then begin to submit jobs that require them.
Note that a Fusion file system is not strictly required to have your working directory in S3. Nextflow includes support for S3. Either include your AWS access and secret keys in your pipeline configuration or use an IAM role to allow your EC2 instances full access to S3 storage.

Related

On-Premise file backup to aws

Use case:
I have one directory on-premise, I want to make a backup for it let's say at every midnight. And want to restore it if something goes wrong.
Doesn't seem a complicated task,but reading through the AWS documentation even this can be cumbersome and costly.Setting up Storage gateway locally seems unnecessarily complex for a simple task like this,setting up at EC2 costly also.
What I have done:
Reading through this + some other blog posts:
https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html
https://docs.aws.amazon.com/storagegateway/latest/userguide/WhatIsStorageGateway.html
What I have found:
1.Setting up file gateway (locally or as an EC2 instance):
It just mount the files to an S3. And that's it.So my on-premise App will constantly write to this S3.The documentation doesn't mention anything about scheduled backup and recovery.
2.Setting up volume gateway:
Here I can make a scheduled synchronization/backup to the a S3 ,but using a whole volume for it would be a big overhead.
3.Standalone S3:
Just using a bare S3 and copy my backup there by AWS API/SDK with a manually made scheduled job.
Solutions:
Using point 1 from above, enable versioning and the versions of the files will serve as a recovery point.
Using point 3
I think I am looking for a mix of file-volume gateway: Working on file level and make an asynchronus scheduled snapshot for them.
How this should be handled? Isn't there a really easy way which will just send a backup of a directory to the AWS?
The easiest way to backup a directory to Amazon S3 would be:
Install the AWS Command-Line Interface (CLI)
Provide credentials via the aws configure command
When required run the aws s3 sync command
For example
aws s3 sync folder1 s3://bucketname/folder1/
This will copy any files from the source to the destination. It will only copy files that have been added or changed since a previous sync.
Documentation: sync — AWS CLI Command Reference
If you want to be more fancy and keep multiple backups, you could copy to a different target directory, or create a zip file first and upload the zip file, or even use a backup program like Cloudberry Backup that knows how to use S3 and can do traditional-style backups.

What is better Mounting S3 bucket or copying files from S3 bucket to windows EC2 instance?

I have a use case where the CSV files are stored on an S3 bucket by a service. My program running on windows EC2 has to use the CSV files dumped on S3 bucket. Mounting or copying, which approach will be better to use the file? And how to approach it.
Mounting the bucket as a local Windows drive will just cache info about the bucket and copy the files locally when you try to access them. Either way you will end up having the files copied to the Windows machine. If you don't want to program the knowledge of the S3 bucket into your application then the mounting system can be an attractive solution, but in my experience it can be very buggy. I built a system on Windows machines in the past that used an S3 bucket mounting product, but after so many bugs and failures I ended up rewriting it to simply perform an aws s3 sync operation to a local folder before the process ran.
I always suggest copying using either by CLI or directly using endpoints or SDK or whatever the way suggested by AWS but not mounting.
Actually, S3 is not built for a filesystem purpose. It's an object storage system. NOt saying that you cannot do it, but it is not advisable. The correct way to use Amazon S3 is to put/get files using the S3 APIs.
And if you are concerned about the network latency, I would say both will be the same and if you are thinking about directly modifying/editing a file within the file system, No you cannot Since Amazon S3 is designed for atomic operations, they have to be completely replaced with modified files.

Adding s3 bucket as docker volume

I have one spring boot application which is in our internal data center, which process files from a specific folder on the host.
we wanted to deploy this to aws and wanted to use s3 bucket to upload files for processing.
is there any way we can add s3 bucket space as docker volume?
UPD: See at the bottom of this answer.
Other answers mistakenly say that AWS S3 is an object store and you can not mount it as volume to docker. Which is not correct. AWS S3 has a 3rd party FUSE driver, which allows it to be mounted as local filesystem and operate on objects as if those were files.
However it does not seem this FUSE driver has been made available as storage plugin for Docker just yet.
Edit: well, i have to correct myself after just a couple of minutes posting this. There in fact is a FUSE based driver for Docker to get volume mounted from AWS S3. See REX-ray and also here for possible configuration issue.
Other answers have correctly pointed out that :
AWS S3 is an object store and you can not mount it as volume to docker.
That being said, using S3 with spring application is super easy and there is framework developed called spring-cloud. spring-cloud works excellent with AWS.
Here is sample code :
public void uploadFiles(File file, String s3Url) throws IOException {
WritableResource resource = (WritableResource) resourceLoader.getResource(s3Url);
try (OutputStream outputStream = resource.getOutputStream()) {
Files.copy(file.toPath(), outputStream);
}
}
You can find detailed blog over here.
S3 is an object store, not a file system. You should have S3 trigger a message to SQS when new objects are added to the bucket. Then you can code your application running in the Docker container to poll SQS for new messages, and us the S3 location in the message to copy the object from S3 to local storage (using the appropriate AWS SDK) for processing.
No docker volume is for mounting drives on the machine (https://docs.docker.com/storage/volumes/)
You can use the S3 api to manage your bucket from the docker container (https://docs.aws.amazon.com/AmazonS3/latest/API/Welcome.html)

Upload a file to S3 using Chef

I am automating installation of a distributed software using Chef. It has multiple nodes and on the master node, it creates an encryption key in a file. That file should be present on all the slave nodes. I was planning to publish this file to S3 from master node and download it on slave nodes. I know we can use s3_file cookbook to download file from s3. I don't know how to upload a file to s3 in Chef. So, looking for suggestion how can I upload to S3 or what is other workaround if s3 uploading is not available. thanks
There is nothing specific in Chef for this, and I would highly recommend not handling this in Chef as then you have to deal with all kinds of race conditions when multiple nodes are booting simultaneously :) Probably just create and upload the key manually to your secrets management system (S3 buckets with ACLs are a simple option, but there is also SSM Parameter Store, or the newer AWS Secrets Manager), and then just deal with the download end from Chef.
You can use AWS CLI to upload a file to s3. If it is windows then use PowerShell to upload a file otherwise use a terminal script to do so. First, you need to install AWS CLI and then use AWS s3 cp command to upload a local file to AWS.

AWS Elastic Beanstalk save file to S3 or application directory

I have an application deployed on EB that need to download a file from a remote server then serve to visitors
As I understand, Its recommended to save files to S3 instead then grant users access to these files. However, I believe there is no option for S3 to initiate the download of a file on a remote server therefore the process would be :
EB application get the files => EB application upload the files to S3.
That would double the wait time for users.
Should I save files directly to the application directory instead as I will only use 200-300MB max then clean it daily.
Is there any risk or better approach to this problem?
Why would it double the time? The upload to S3 would be extremely quick. You could even stream the file to S3 as it is being downloaded.
Saving the files to the server will prevent you from scaling your application beyond a single server.