AWS S3 mount on Docker in AWS Batch - amazon-web-services

I want to mount an AWS s3 bucket on my Docker container which I am using to run some AWS Batch jobs.
I have been researching several ways of going about this but I still lack clarity as to how I can get this to work on AWS Batch which is going to dynamically allocate EC2 instances based on the job definitions.
The following are the ideas I have gathered , but I am unsure of how to put them together:
https://rexray.readthedocs.io/en/v0.9.0/user-guide/docker-plugins/
I could use this plugin to mount S3 bucket as Docker volume , but unsure how to do this on AWS Batch. Should this plugin be a part of the Docker image ?
I could use s3fs-fuse but I was told that I wont be able to install or store any of the files from S3 on EC2 instances on AWS Batch instances, which can then be mounted in docker. - is there a way to do this by including some code in the AMI that will copy files from s3 to instance?
Are there any other ways I can get this to work?
Pardon me if my questions are too basic. I am fairly new to Docker and AWS Batch. Would appreciate any help!
Thanks !

I have personally used s3fs to solve this problem in the past. Using S3 as a mounted filesystem has some caveats which you would be wise to familiarize yourself with (because you are treating something that is not a filesystem like it is a filesystem, a classic leaky abstraction problem), but if your workflow is relatively simple and does not have the possibility for race conditions you should be able to do it with some confidence (especially now that as of Dec 2020 AWS S3 has released read-after-write consistency automatically for all applications).
To answer your other question:
I could use s3fs-fuse but I was told that I wont be able to install or store any of the files from S3 on EC2 instances on AWS Batch instances, which can then be mounted in docker. - is there a way to do this by including some code in the AMI that will copy files from s3 to instance?
If you use s3fs to mount your S3 bucket as a filesystem within docker, you don't need to worry about copying files from S3 to the instance, indeed the whole point of using s3fs is that you can access all your files in S3 from the container without having to move then off of S3.
Say for instance you mount your S3 bucket s3://my-test-bucket to /data in the container. You can then run your program like my-executable --input /data/my-s3-file --output /data/my-s3-output as if the input file was right there on the local filesystem. When its done you can see the output file will be on S3 in s3://my-test-bucket/my-s3-output. This can simply your workflow / cut down on glue code quite a bit.
My dockerfile for my s3fs AWS batch container looks like this:
FROM ubuntu:18.04
RUN apt-get -y update && apt-get -y install curl wget build-essential automake libcurl4-openssl-dev libxml2-dev pkg-config libssl-dev libfuse-dev parallel
RUN wget https://github.com/s3fs-fuse/s3fs-fuse/archive/v1.86.tar.gz && \
tar -xzvf v1.86.tar.gz && \
cd s3fs-fuse-1.86 && \
./autogen.sh && \
./configure --prefix=/usr && \
make && \
make install && \
rm -rf s3fs-fuse-1.86 v1.86.tar.gz
RUN mkdir /data
COPY entrypoint.sh /entrypoint.sh
ENTRYPOINT ["/entrypoint.sh"]
entrypoint.sh is a convenience for always running the s3fs mount before the main program (this breaks the paradigm of one process per docker container but, I don't think its cause for major concern here). It looks like this:
#!/bin/bash
bucket=my-bucket
s3fs ${bucket} /data -o ecs
echo "Mounted ${bucket} to /data"
exec "$#"
Note related answer here: https://stackoverflow.com/a/60556131/1583239

I am assuming you want to read/write to an S3 bucket. You can do this within your containerized code by using a library like boto3. You will also need to provide IAM permissions for the container to access S3.

Related

How can I install aws cli, from WITHIN the ECS task?

Question:
How can I install aws cli, from WITHIN the ECS task ?
DESCRIPTION:
I'm using a docker container to run the logstash application (it is part of the elastic family).
The docker image name is "docker.elastic.co/logstash/logstash:7.10.2"
This logstash application needs to write to S3, thus it needs AWS CLI installed.
If aws is not installed, it crashes.
# STEP 1 #
To avoid crashing, when I used this application only as a docker, I ran it in a way that I caused the 'logstash start' to be delayed, after docker container was started.
I did this by adding "sleep" command to an external docker-entrypoint file, before it starts the logstash.
This is how it looks in the docker-entrypoint file:
sleep 120
if [[ -z $1 ]] || [[ ${1:0:1} == '-' ]] ; then
exec logstash "$#"
else
exec "$#"
fi
# EOF
# STEP 2 #
run the docker with "--entrypoint" flag so it will use my entrypoint file
docker run \
-d \
--name my_logstash \
-v /home/centos/DevOps/psifas_logstash_docker-entrypoint:/usr/local/bin/psifas_logstash_docker-entrypoint \
-v /home/centos/DevOps/logstash.conf:/usr/share/logstash/pipeline/logstash.conf \
-v /home/centos/DevOps/logstash.yml:/usr/share/logstash/config/logstash.yml \
--entrypoint /usr/local/bin/psifas_logstash_docker-entrypoint \
docker.elastic.co/logstash/logstash:7.10.2
# STEP 3 #
install aws cli and configure aws cli from the server hosting the docker:
docker exec -it -u root <DOCKER_CONTAINER_ID> yum install awscli -y
docker exec -it <DOCKER_CONTAINER_ID> aws configure set aws_access_key_id <MY_aws_access_key_id>
docker exec -it <DOCKER_CONTAINER_ID> aws configure set aws_secret_access_key <MY_aws_secret_access_key>
docker exec -it <DOCKER_CONTAINER_ID> aws configure set region <MY_region>
This worked for me,
Now I want to "translate" this flow into an AWS ECS task.
in ECS I will use parameters instead of running the above 3 "aws configure" commands.
MY QUESTION
How can I do my 3rd step, installing aws cli, from WITHIN the ECS task ? (meaning not to run it on the EC2 server hosting the ECS cluster)
When I was working on the docker I also thought of these options to use the aws cli:
find an official elastic docker image containing both logstash and aws cli. <-- I did not find one.
create such an image by myself and use. <-- I prefer not , because I want to avoid the maintenance of creating new custom images when needed (e.g when new version of logstash image is available).
Eventually I choose the 3 steps above, but I'm open to suggestion.
Also, My tests showed that running 2 containers within the same ECS task:
logstah
awscli
and then the logstash container will use the aws cli container
(image "amazon/aws-cli") is not working.
THANKS A LOT IN ADVANCE :-)
Your option #2, create the image yourself, is really the best way to do this. Anything else is going to be a "hack". Also, you shouldn't be running aws configure for an image running in ECS, you should be assigning a IAM role to the task, and the AWS CLI will pick that up and use it.
Mark B, your answer helped me to solve this. Thanks!
writing here the solution in case it will help somebody else.
There is no need to install AWS CLI, in the logstash docker container running inside the ECS task.
Inside the logstash container (from image "docker.elastic.co/logstash/logstash:7.10.2") there is AWS SDK to connect to the S3.
The only thing required is to allow the ECS Task execution role, access to S3.
(I attached AmazonS3FullAccess policy)

fuse: device not found, try 'modprobe fuse' first while mounting s3 bucket on a docker container

I have to mount the s3 bucket over docker container so that we can store its contents in an s3 bucket.
I found https://www.youtube.com/watch?v=FFTxUlW8_QQ&ab_channel=ValaxyTechnologies video which shows how to do the same process for ec2 instance instead of a docker container.
I am following the same steps as mentioned in the above link. Likewise, I have done the following things on the docker container:
(Install FUSE Packages)
apt-get install build-essential gcc libfuse-dev libcurl4-openssl-dev libxml2-dev mime-support pkg-config libxml++2.6-dev libssl-dev
git clone https://github.com/s3fs-fuse/s3fs-fus...
cd s3fs-fuse
./autogen.sh
./configure
make
make install
(Ensure you have an IAM Role with Full Access to S3)
(Create the Mountpoint)
mkdir -p /var/s3fs-demo-fs
(Target Bucket)
aws s3 mb s3://s3fs-demo-bkt
But when I trying to mount the s3 bucket using
s3fs s3fs-demo-bkt /var/s3fs-demo-fs -o iam_role=
I am getting the following messege
fuse: device not found, try 'modprobe fuse' first
I have looked over several solutions for this problem. But I am not able to resolve this issue. Please let me know how I can solve it.
I encountered the same problem. But later the issue was fixed by adding --privileged when running the docker run command

How to increase the maximum size of the AWS lambda deployment package (RequestEntityTooLargeException)?

I upload my lambda function sources from AWS codebuild. My Python script uses NLTK so it needs a lot of data. My .zip package is too big and an RequestEntityTooLargeException occurs. I want to know how to increase the size of the deployment package sent via the UpdateFunctionCode command.
I use AWS CodeBuild to transform the source from a GitHub repository to AWS Lambda. Here is the associated buildspec file:
version: 0.2
phases:
install:
commands:
- echo "install step"
- apt-get update
- apt-get install zip -y
- apt-get install python3-pip -y
- pip install --upgrade pip
- pip install --upgrade awscli
# Define directories
- export HOME_DIR=`pwd`
- export NLTK_DATA=$HOME_DIR/nltk_data
pre_build:
commands:
- echo "pre_build step"
- cd $HOME_DIR
- virtualenv venv
- . venv/bin/activate
# Install modules
- pip install -U requests
# NLTK download
- pip install -U nltk
- python -m nltk.downloader -d $NLTK_DATA wordnet stopwords punkt
- pip freeze > requirements.txt
build:
commands:
- echo 'build step'
- cd $HOME_DIR
- mv $VIRTUAL_ENV/lib/python3.6/site-packages/* .
- sudo zip -r9 algo.zip .
- aws s3 cp --recursive --acl public-read ./ s3://hilightalgo/
- aws lambda update-function-code --function-name arn:aws:lambda:eu-west-3:671560023774:function:LaunchHilight --zip-file fileb://algo.zip
- aws lambda update-function-configuration --function-name arn:aws:lambda:eu-west-3:671560023774:function:LaunchHilight --environment 'Variables={NLTK_DATA=/var/task/nltk_data}'
post_build:
commands:
- echo "post_build step"
When I launch the pipeline, I have RequestEntityTooLargeException because there are too many data in my .zip package. See the build logs below:
[Container] 2019/02/11 10:48:35 Running command aws lambda update-function-code --function-name arn:aws:lambda:eu-west-3:671560023774:function:LaunchHilight --zip-file fileb://algo.zip
An error occurred (RequestEntityTooLargeException) when calling the UpdateFunctionCode operation: Request must be smaller than 69905067 bytes for the UpdateFunctionCode operation
[Container] 2019/02/11 10:48:37 Command did not exit successfully aws lambda update-function-code --function-name arn:aws:lambda:eu-west-3:671560023774:function:LaunchHilight --zip-file fileb://algo.zip exit status 255
[Container] 2019/02/11 10:48:37 Phase complete: BUILD Success: false
[Container] 2019/02/11 10:48:37 Phase context status code: COMMAND_EXECUTION_ERROR Message: Error while executing command: aws lambda update-function-code --function-name arn:aws:lambda:eu-west-3:671560023774:function:LaunchHilight --zip-file fileb://algo.zip. Reason: exit status 255
Everything works correctly when I reduce the NLTK data to download (I tried with only the packages stopwords and wordnet.
Does anyone have an idea to solve this "size limit problem"?
You cannot increase the deployment package size for Lambda. AWS Lambda limits are described in AWS Lambda developer guide. More information on how those limits work can be seen here. In essence, your unzipped package size has to be less than 250MB (262144000 bytes).
PS: Using layers doesn't solve sizing problem, though helps with management & maybe faster cold start. Package size includes the layers - Lambda layers.
A function can use up to 5 layers at a time. The total unzipped size of the function and all layers can't exceed the unzipped deployment package size limit of 250 MB.
Update Dec 2020 : As per AWS blog, as pointed by user jonnocraig in this answer, you can overcome these restrictions if you build a container for your application & run it on Lambda.
If anyone stumbles across this issue post December 2020, there's been a major update from AWS to support Lambda functions as container images (up to 10GB!!). More info here
AWS Lambda functions can mount EFS. You can load libraries or packages that are larger than the 250 MB package deployment size limit of AWS Lambda using EFS.
Detailed steps on how to set it up are here:
https://aws.amazon.com/blogs/aws/new-a-shared-file-system-for-your-lambda-functions/
On a high level, the changes include:
Create and setup EFS file system
Use EFS with lambda function
Install the pip dependencies inside EFS access point
Set the PYTHONPATH environment variable to tell where to look for the dependencies
The following are hard limits for Lambda (may change in future):
3 MB for in-console editing
50 MB zipped as package for upload
250 MB when unzipped including layers
A sensible way to get around this is to mount EFS from your Lambda. This can be useful not only for loading libraries, but also for other storage.
Have a look through these blogs:
https://aws.amazon.com/blogs/compute/using-amazon-efs-for-aws-lambda-in-your-serverless-applications/
https://aws.amazon.com/blogs/aws/new-a-shared-file-system-for-your-lambda-functions/
I have not tried this myself, but the folks at Zappa describe a trick that might help. Quoting from https://blog.zappa.io/posts/slim-handler:
Zappa zips up the large application and sends the project zip file up to S3. Second, Zappa creates a very minimal slim handler that just contains Zappa and its dependencies and sends that to Lambda.
When the slim handler is called on a cold start, it downloads the large project zip from S3 and unzips it in Lambda’s shared /tmp space. All subsequent calls to that warm Lambda share the /tmp space and have access to the project files; so it is possible for the file to only download once if the Lambda stays warm.
This way you should get 500MB in /tmp.
Update:
I have used the following code in the lambdas of a couple of projects, it is based on the method zappa used, but can be used directly.
# Based on the code in https://github.com/Miserlou/Zappa/blob/master/zappa/handler.py
# We need to load the layer from an s3 bucket into tmp, bypassing the normal
# AWS layer mechanism, since it is too large, AWS unzipped lambda function size
# including layers is 250MB.
def load_remote_project_archive(remote_bucket, remote_file, layer_name):
# Puts the project files from S3 in /tmp and adds to path
project_folder = '/tmp/{0!s}'.format(layer_name)
if not os.path.isdir(project_folder):
# The project folder doesn't exist in this cold lambda, get it from S3
boto_session = boto3.Session()
# Download zip file from S3
s3 = boto_session.resource('s3')
archive_on_s3 = s3.Object(remote_bucket, remote_file).get()
# unzip from stream
with io.BytesIO(archive_on_s3["Body"].read()) as zf:
# rewind the file
zf.seek(0)
# Read the file as a zipfile and process the members
with zipfile.ZipFile(zf, mode='r') as zipf:
zipf.extractall(project_folder)
# Add to project path
sys.path.insert(0, project_folder)
return True
This can then be called as follows (I pass the bucket with the layer to the lambda function via an env variable):
load_remote_project_archive(os.environ['MY_ADDITIONAL_LAYERS_BUCKET'], 'lambda_my_extra_layer.zip', 'lambda_my_extra_layer')
At the time when I wrote this code, tmp was also capped, I think to 250MB, but the call to zipf.extractall(project_folder) above can be replaced with extracting directly to memory: unzipped_in_memory = {name: zipf.read(name) for name in zipf.namelist()}
which I did for some machine learning models, I guess the answer of #rahul is more versatile for this though.
From the AWS documentation:
If your deployment package is larger than 50 MB, we recommend
uploading your function code and dependencies to an Amazon S3 bucket.
You can create a deployment package and upload the .zip file to your
Amazon S3 bucket in the AWS Region where you want to create a Lambda
function. When you create your Lambda function, specify the S3 bucket
name and object key name on the Lambda console, or using the AWS
Command Line Interface (AWS CLI).
You can use the AWS CLI to deploy the package, and instead of using the --zip-file argument to pass the deployment package, you can specify the object in the S3 bucket with the --code parameter. Ex:
aws lambda create-function --function-name my_function --code S3Bucket=my_bucket,S3Key=my_file
This aws wrangler zip file from github (https://github.com/awslabs/aws-data-wrangler/releases) includes many other libraries like pandas and pymysql. In my case it was the only layer I needed since it has so much other stuff. Might work for some people.
You can try the workaround used in the awesome serverless-python-requirements plugin.
Ideal solution is to use lambda layers if it solves the purpose. If the total dependency is greater than 250MB then you can sideload lesser used dependencies from S3 bucket during run time by utilizing the 512 MB provided in /tmp directory. The zipped dependencies are stored in S3 and lambda can fetch the files from S3 during initialisation. Unzip the dependecy pacakge and add the path to sys path.
Please note that the python dependencies need to be built on the Amazon Linux, which is the operating system for lambda containers. I used a EC2 instance to create the zip package.
You check the code used in serverless-python-requirements here
Before 2021, the best way was to deploy the jar file to S3, and create AWS lambda with it.
From 2021, AWS Lambda begin to support container image. Read here : https://aws.amazon.com/de/blogs/aws/new-for-aws-lambda-container-image-support/
So from now on, you should probably consider package and deploy your Lambda functions as container images(up to 10 GB).
The tips to use large lambda project into AWS is to use a docker image store in the AWS ECR service instead of a ZIP file. You can use a docker image up to 10GO.
The AWS documentation provide an example to help you here :
Create an image from an AWS base image for Lambda
May be late to the party but you can use a Docker Image to get around the lambda layer constraint. This can be done using serverless stack development or just through the console.
You cannot increase the package size, but you can use AWS Lambda layers to store some application dependencies.
https://docs.aws.amazon.com/lambda/latest/dg/configuration-layers.html#configuration-layers-path
Before this layers a common used pattern to workaround this limitation was to download huge dependencies from S3.

AWS S3FS How to

Here's the current scenario -
I have multiple S3 Buckets, which have SQS events configured for PUTs of Objects from a FTP, which I have configured using S3FS.
Also, I have multiple Directories on an EC2, on which a User can PUT an object, which gets synced with the different S3 buckets (using S3FS), which generate SQS events(using S3's SQS events).
Here's what I need to achieve,
Instead of Multiple S3 buckets, I need to consolidate the logic on Folder level,
ie. I have now created Different Folders for each Bucket that I had created previously, I have created separate SQS events for PUT in individual Folders.
Now the Bucket level logic of S3FS, I want to tweak for Folder level in a Single S3 bucket.
ie. I want to create 3 different Directories oon the EC2, eg A,B,C.
If I PUT an object in Directory A of the EC2, the object must get synced with Folder A in the S3 bucket,
Similarly for Directory B and folder B of S3 and Directory C on EC2 and Folder C on the S3.
Here are the steps I created for installing S3FS -
Steps -
ssh into the EC2
sudo apt-get install automake autotools-dev g++ git libcurl4-gnutls-dev libfuse-dev libssl-dev libxml2-dev make pkg-config
git clone https://github.com/s3fs-fuse/s3fs-fuse.git
cd s3fs-fuse
./autogen.sh
./configure
make
sudo make install
Mounting S3 Bucket to File System
echo access-key-id:secret-access-key > /etc/passwd-s3fs
chmod 600 /etc/passwd-s3fs
mkdir /mnt/bucketname
echo s3fs#bucketname /mnt/bucketname fuse _netdev,rw,nosuid,nodev,allow_other 0 0 >> /etc/fstab
mount -a
Now these steps achieve sync between a particular Directory on the EC2 and the S3 bucket,
How do I tweak this to sync say 2 different Directories on the EC2 with 2 different Folders on the S3.
I am a Linux and AWS newbie, Please help me out.
Do not mount the S3 bucket to the file system. Use AWS S3 CLI and Cron to Sync the EC2 Directory with the S3 Bucket Directory.
Install S3CMD on the EC2 instance (http://tecadmin.net/install-s3cmd-manage-amazon-s3-buckets/#)
Start a cron job for achieving the Sync with the local directory and the S3 Bucket Subfolder.
Create a Script File for example "script.sh"
#/bin/bash
aws s3 sync /path/to/folder/A s3://mybucket/FolderA
aws s3 sync /path/to/folder/B s3://mybucket/FolderB
aws s3 sync /path/to/folder/C s3://mybucket/FolderC
Start a cron job for some thing like this:
* * * * * /root/scripts/script.sh
And you will achieve your use case.

User Data script not downloading files from S3 bucket?

i have s3cmd and EC2 api pre configured AMI. While creating new instance with user data for downloading files from S3 bucket, i face some following problems.
in user data i have code for
- creating new directory on new instance.
- downloading file from AWS S3 bucket.
Script is
#! /bin/bash
cd /home
mkdir pravin
s3cmd get s3://bucket/usr.sh >> temp.log
But in above script , mkdir pravin creates new directory with name pravin but s3cmd get s3://bucket/usr.sh not downloads file from AWS S3 bucket.
it also creates temp.log but it is empty.
So how i can solve this problem ?
An alternative solution would be to use an instance that has an IAM role assigned to it and the aws-cli, which would require that you have Python installed. All of this could be accomplished by inserting the following in the user-data field for your instance:
#!/bin/bash
apt-get update
apt-get -y install python-pip
apt-get -y install awscli
mkdir pravin
aws s3 cp s3://bucket/usr.sh temp.log --region {YOUR_BUCKET_REGION}
NOTE: The above is applicable for Ubuntu only.
And then for your instances IAM role you would attach a policy like so:
{
"Version":"2012-10-17",
"Statement":[{
"Effect":"Allow",
"Action":"s3:*",
"Resource":"arn:aws:s3:::YourBucketName/*"
}
]
}
I suspect that the user running the user data script lacks a .s3cfg file. You may need to find a way to indicate the location of the file when running this script.