delete s3 files from a pipeline AWS - amazon-web-services

I would like to ask about a processing task I am trying to complete using a data pipeline in AWS, but I have not been able to get it to work.
Basically, I have 2 data nodes representing 2 MySQL databases, where the data is supposed to be extracted from periodically and placed in an S3 bucket. This copy activity is working fine selecting daily every row that has been added, let's say today - 1 day.
However, that bucket containing the collected data as CSVs should become the input for an EMR activity, which will be processing those files and aggregating the information. The problem is that I do not know how to remove or move the already processed files to a different bucket so I do not have to process all the files every day.
To clarify, I am looking for a way to move or remove already processed files in an S3 bucket from a pipeline. Can I do that? Is there any other way I can only process some files in an EMR activity based on a naming convention or something else?

Even better, create a DataPipeline ShellCommandActivity and use the aws command line tools.
Create a script with these two lines:
sudo yum -y upgrade aws-cli
aws s3 rm $1 --recursive
The first line ensures you have the latest aws tools.
The second one removes a directory and all its contents. The $1 is an argument passed to the script.
In your ShellCommandActivity:
"scriptUri": "s3://myBucket/scripts/theScriptAbove.sh",
"scriptArgument": "s3://myBucket/myDirectoryToBeDeleted"
The details on how the aws s3 command works are at:
http://docs.aws.amazon.com/cli/latest/reference/s3/index.html

1) Create a script which takes input path and then deletes the files using hadoop fs -rmr s3path.
2) Upload the script to s3
In emr use the prestep -
1) hadoop fs -copyToLocal s3://scriptname .
2) chmod +x scriptname
3) run script
That pretty much it.

Another approach without using EMR is to install s3cmd tool through ShellCommandActivity in a small EC2 instance, then you can use s3cmd in pipeline to operate your S3 repo in whatever way you want.
A tricky part of this approach is to configure s3cmd through a configuration file safely (basically pass access key and secret), as you can't just ssh into the EC2 instance and use 's3cmd --configure' interactively in a pipeline.
To do that, you create a config file in the ShellCommandActivity using 'cat'. For example:
cat <<EOT >> s3.cfg
blah
blah
blah
EOT
Then use '-c' option to attach the config file every time you call s3cmd like this:
s3cmd -c s3.cfg ls
Sounds complicated, but works.

Related

How can I dump a database directly to an s3 bucket? [duplicate]

I have got one server in Rackspace and i'm already running a cron job evry day night to process something...(some account related operation- that will send me email every mid night). my application is in groovy on grails. now i want to take mysql database (called myfleet) backup on every mid night and put that file in Amezon S3 . how can i do that? do i need to write any java or groovy file to process that? or is it can be done from Linux box itself? i have already got account in Amezon S3 (bucket name is fleetBucket)
You can also use STDOUT and the AWS CLI tool to pipe the output of your mysqldump straight to S3:
mysqldump -h [db_hostname] -u [db_user] -p[db_passwd] [databasename] | aws s3 cp - s3://[s3_bucketname]/[mysqldump_filename]
For example:
mysqldump -h localhost -u db_user -ppassword test-database | aws s3 cp - s3://database-mysqldump-bucket/test-database-dump.sql
The mysqldump command outputs to STDOUT by default. Using - as the input argument for aws s3 cp tells the AWS CLI tool to use STDIN for the input.
mysqldump --host=$HOST --user=$USER --password=$PASSWORD $DB_NAME --routines --single-transaction | gzip -9 | aws s3 cp - s3://bucket/database/filename.sql.gz
will directly store file to s3.
Should be pretty straightforward:
- backup your database using mysqldump
mysqldump -u [uname] -p[pass] myfleet | gzip -9 > myfleet.sql.gz
- upload your dump file to S3 using a command line client (e.g. http://s3tools.org/s3cmd:
s3cmd put myfleet.sql.gz s3://<bucketname>/myfleet.sql.gz
Just add this to your cron job (you might want to use some kind of numbering scheme for the dump files, in case you want to keep several versions).
If the source DB is on AWS and is of type Aurora.Mysql you can backup directly to S3 with a command like
SELECT * FROM employees INTO OUTFILE S3 's3-us-west-2://aurora-select-into-s3-pdx/sample_employee_data';
See https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Integrating.SaveIntoS3.html for details.

Delete files older than 30 days under S3 bucket recursively without deleting folders using PowerShell

I can delete files and exclude folders with following script
aws s3 rm s3://my-bucket/ --recursive --exclude="*" --include="*/*.*"
when i tried to add pipe to delete only older files, i'm unable to.. please help with the script.
aws s3 rm s3://my-bucket/ --recursive --exclude="*" --include="*/*.*" | Where-Object {($_.LastModified -lt (Get-Date).AddDays(-31))}
The approach should be to list the files you need, then pipe the results to a delete call (a reverse of what you have). This might be better managed by a full blown script rather than a one line shell command. There's an article on this and some examples here.
Going forward, you should let S3 versioning take care of this, then you don't have to manage a script or remember to run it. Note: it'll only work with files that are added after versioning has been enabled.

How do I download files with AWS CLI based on a list?

I'm trying to download a subset of files from a public s3 bucket that contains millions of IRS files. I can download the entire repository with the command:
aws s3 sync s3://irs-form-990/ ./
But it takes way too long!
I know I should be using the --include / --exclude flags, but I don't know how to use them with a list of values. I have a csv that contains unique identifiers for all the files from 2017 that I'd like, but how do I use it in with AWS CLI? The list itself is half a million IDs long.
Help much appreciated. Thank you.
There is a bash script which can read all the filenames from a file filename.txt.
All you have to do is to convert those IDs in filenames.
#!/bin/bash
set -e
while read line
do
aws s3 cp s3://bucket-name/$line dest-path/
done <filename.txt
This question was asked before and the answer you can find it here

Google cloud vm stop or freeze when extracting a .7z file

Resently I am working with Google Cloud Compute Engine to train a ml model
So I am tring to extract a .7z fike that has the data.
But it is too big and the machine even freezes or stops for uncatching error
I am using the Linux command below:
!7zr 'path of the file'
Any help to be able extracting the file ... Thanks in advance
You could try it by using GCS
Create a directory that only has the compressed file in it and nothing else,
yourdir/myfile.7z
Create an environment variable MYFILE=myfile.7z
Create a bucket on GCS using the gsutil cli:
gsutil mb gs://yourbucket/MY_DIR_FOR_ZIP_FILE
Next you upload the file to the bucket, like so
gsutil cp -m -v $MYFILE gs://MYBUCKET/MY_DIR_FOR_ZIP_FILE
Within the VM you can now download the file, again using gsutil cli
gsutil cp -m -v gs://MYBUCKET/MY_DIR_FOR_ZIP_FILE /YOU_DIR
Then extract and also remove the compresses file,
7z x $MYFILE && rm -v $MYFILE
You should now have the uncompressed file on the VM
Make sure to use the -m flag this will perform a parallel (multi-threaded/multi-processing) copy.
Here is the reference cp - Copy files and objects
Using the gsutil tool
The instructions above assumes that the size of your data is less than 1TB, and also you are using a VM with a disk size large enough to accomadate the data.
If your data is more than 1TB, you will need to use Transfer service for on-premises data.
The steps to follow when setting up transfer jobs are listed here
Creating a transfer job

AWS CLI - is there a way to extract tar.gz from S3 to home without storing the tar.gz?

To elaborate,
There is a tar.gz file on my AWS S3, let's call it example.tar.gz.
So, what I want to do is download the extracted contents of example.tar.gz to /var/home/.
One way to do it is to simply download the tar.gz, extract it, then delete the tar.gz.
However, I don't want to use space downloading the tar.gz file, I just want to download the extracted version or only store the extracted version.
Is this possible?
Thanks!
What you need is the following:
aws s3 cp s3://example-bucket/file.tar.gz - | tar -xz
This will stream the file.tar.gz from s3 and extract it directly (in-memory) to the current directory. No temporary files, no extra storage and no clean up after this one command.
Make sure you write the command exactly as above.
Today I tested with Python Boto 3 and aws cli and I noticed that tar.gz is extracted automatically when the file is downloaded
There isn't currently a way you can do this with S3.
You could create the following script though and just run it whenever you wish to download the tar. Just as long as you have the IAM role / access keys setup.
!#/bin/bash
aws s3 cp s3://$1/$2 $3
tar -xvf $3
rm $3
Then just call the script using ./myScript BUCKET_NAME FILE_LOCATION OUTPUT_FILE