Limit Upload bandwidth for S3 sync with ansible - amazon-web-services

AWS provides a config to limit the upload bandwidth when copying files to s3 from ec2 instances. This can be configured by below AWS config.
aws configure set default.s3.max_bandwidth
Once we set this config and run an AWS CLI command to cp files to s3 bandwidth is limited.
But when I run the s3_sync ansible module on the same ec2 instance that limitation is not getting applied. Any possible workaround to apply the limitation to ansible as well?

Not sure if this is possible because botocore may not support this.
Mostly is up to Amazon to fix their python API.
For example Docker module works fine by sharing confugration between cli and python-api.
Obviously that I assumed you did run this command locally as the same user because otherwise the aws config you made would clearly not be used.

Related

Connect to aws storage gateway with s3 backing from a different cloud vps (say digitalocean)

My goal is to have a s3 backed storage gateway available as a disk on a digitalocean droplet. By watching the video's about storage gateway (e.g https://www.youtube.com/watch?v=QaCfOatTIDA&t=136s) it seems like it is possible
But I am quite confused by the options available when I go about configuring aws storage gateway. I dont understand which of these options I should choose
Is there any guide available on how to make aws cloud storage available for a VPS on a different cloud provider?
My digitalocean droplet is an Ubuntu 20.04 if that matters
Usually you would use fuse to "mount" S3 in your linux OS. Most notably, this can be easily done using s3fs:
s3fs allows Linux and macOS to mount an S3 bucket via FUSE. s3fs preserves the native object format for files, allowing use of other tools like AWS CLI.

Do I have to install AWS CLI in each server?

I have multiple standalone servers from where I want to upload/sync directories to Object Storage usign AWS CLI.
Do I have to install AWS CLI in each server? OR is there a common console/platform provided within AWS Object Storage from where I can call the same command over something like say SSH. How can I avoid installing cli to all the servers?
You have to install AWS CLI in all the servers even if you write the script to ssh from a single server that which is installed AWS CLI, SSH protocol will take the configuration from the remote server, not from a server where the script is running. It's better if you use a configuration management tool like ansible to speed up the process.

How to run glue script from Glue Dev Endpoint

I have a glue script (test.py) written say in a editor. I connected to glue dev endpoint and copied the script to endpoint or I can store in S3 bucket. Basically glue endpoint is an EMR cluster, now how can I run the script from the dev endpoint terminal? Can I use spark-submit and run it ?
I know we can run it from glue console,but more interested to know if I can run it from glue end point terminal.
You don't need a notebook; you can ssh to the dev endpoint and run it with the gluepython interpreter (not plain python).
e.g.
radix#localhost:~$ DEV_ENDPOINT=glue#ec2-w-x-y-z.compute-1.amazonaws.com
radix#localhost:~$ scp myscript.py $DEV_ENDPOINT:/home/glue/myscript.py
radix#localhost:~$ ssh -i {private-key} $DEV_ENDPOINT
...
[glue#ip-w-x-y-z ~]$ gluepython myscript.py
You can also run the script directly without getting an interactive shell with ssh (of course, after uploading the script with scp or whatever):
radix#localhost:~$ ssh -i {private-key} $DEV_ENDPOINT gluepython myscript.py
If this is a script that uses the Job class (as the auto-generated Python scripts do), you may need to pass --JOB_NAME and --TempDir parameters.
For development / testing purpose, you can setup a zeppelin notebook locally, have an SSH connection established using the AWS Glue endpoint URL, so you can have access to the data catalog/crawlers,etc. and also the s3 bucket where your data resides.
After all the testing is completed, you can bundle your code, upload to an S3 bucket. Then create a Job pointing to the ETL script in S3 bucket, so that the job can be run, and scheduled as well.
Please refer here and setting up zeppelin on windows, for any help on setting up local environment. You can use dev instance provided by Glue, but you may incur additional costs for the same(EC2 instance charges).
Once you set up the zeppelin notebook, you can copy the script(test.py) to the zeppelin notebook, and run from the zeppelin.
According to AWS Glue FAQ:
Q: When should I use AWS Glue vs. Amazon EMR?
AWS Glue works on top of the Apache Spark environment to provide a
scale-out execution environment for your data transformation jobs. AWS
Glue infers, evolves, and monitors your ETL jobs to greatly simplify
the process of creating and maintaining jobs. Amazon EMR provides you
with direct access to your Hadoop environment, affording you
lower-level access and greater flexibility in using tools beyond
Spark.
Do you have any specific requirement to run Glue script in an EMR instance? Since in my opinion, EMR gives more flexibility and you can use any 3rd party python libraries and run directly in a EMR Spark cluster.
Regards

AWS S3 as Docker volume?

I'm trying to set up a Gitlab CI Pipeline using Docker on AWS EC2. Everything worked as expected until I hit the storage cap on my EC2 instance (8GB). As I quickly learn, a pipeline could easily use up 1-2 GB of data. Having 4 on the server and everything stop.
Granted I could look into optimising Docker storage usage e.g. using Alpine, however I do need a more permanent solution because 8GB would hardly suffice.
I have been trying to use s3 bucket, with s3fs, as Docker volume to handle my data hunger pipelines but to no avail. Docker volume use hardlinks which are not supported by s3fs.
Is this possible to configure Docker to use symlink instead? Or alternatively, if are there other packages which mount s3 as a "true" filesystem?

How to restore postgres dump with RDS?

I have a postgres dump in AWS S3 bucket, what is the most convenience way to restore it in a AWS RDS ?
AFAIK, there is no native AWS way to manually push data from S3 to anywhere else. The dump stored on S3 needs to first be downloaded and then restored.
You can use the link posted above (http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/PostgreSQL.Procedural.Importing.html), however that doesn't help you download the data.
The easiest way to get something off of S3 is to simply go to the S3 console and point/click your way to the file, right click it and click Download. If you need to restore FROM an EC2 instance (e.g. because your RDS does not have a public IP), than install and configure the AWS CLI (http://docs.aws.amazon.com/cli/latest/userguide/installing.html).
Once you have the CLI configured, download with the following command:
aws s3 cp s3://<<bucket>>/<<folder>>/<<folder>>/<<key>> dump.gz
NOTE: the above command may need some additional tweaking depending on whether you have multiple AWS profiles installed on the machine, the dump is not one file (but many), etc.
From there restore to RDS just like you would a normal Postgres server following the instructions in the AWS link.
Hope that helps!