How to clean up S3 files that is used by AWS Firehose after loading the files? - amazon-web-services

AWS Firehose uses S3 as an intermittent storage before the data is copied to redshift. Once the data is transferred to redshift, how to clean them up automatically if it succeeds.
I deleted those files manually, it went out of state complaining that files got deleted, I had to delete and recreate Firehose again to resume.
Deleting those files after 7 days with S3 rules will work? or Is there any automated way, that Firehose can delete the successful files that got moved to redshift.

Discussing with Support AWS,
Confirmed it is safe to delete those intermediate files after 24 hour period or to the max retry time.
A Lifecycle rule with an automatic deletion on S3 Bucket should fix the issue.
Hope it helps.

Once you're done loading your destination table, execute something similar to (the below snippet is typical to a shell script):
aws s3 ls $aws_bucket/$table_name.txt.gz
if [ "$?" = "0" ]
then
aws s3 rm $aws_bucket/$table_name.txt.gz
fi
This'll check whether the table you've just loaded exists on s3 or not and will drop it. Execute it as a part of a cronjob.
If your ETL/ELT is not recursive, you can write this snippet towards the end of the script. It'll delete the file on s3 after populating your table. However, before execution of this part, make sure that your target table has been populated.
If you ETL/ELT is recursive, you may put this somewhere at the beginning of the script to check and remove the files created in the previous run. This'll retain the files created till the next run and should be preferred as the file will act as a backup in case the last load fails (or you need a flat file of the last load for any other purpose).

Related

Copy ~200.000 of s3 files to new prefixes

I have ~200.000 s3 files that I need to partition, and have made an Athena query to produce a target s3 key for each of the original s3 keys. I can clearly create a script out of this, but how to make the process robust/reliable?
I need to partition csv files using info inside each csv so that each file is moved to a new prefix in the same bucket. The files are mapped 1-to-1, but the new prefix depends on the data inside the file
The copy command for each would be something like:
aws s3 cp s3://bucket/top_prefix/file.csv s3://bucket/top_prefix/var1=X/var2=Y/file.csv
And I can make a single big script to copy all through Athena and bit of SQL, but I am concerned about doing this reliably so that I can be sure that all are copied across, and not have the script fail, timeout etc. Should I "just run the script"? From my machine or better to put it in an ec2 1st? These kinds of questions
This is a one-off, as the application code producing the files in s3 will start outputting directly to partitions.
If each file contains data for only one partition, then you can simply move the files as you have shown. This is quite efficient because the content of the files do not need to be processed.
If, however, lines within the files each belong to different partitions, then you can use Amazon Athena to 'select' lines from an input table and output the lines to a destination table that resides in a different path, with partitioning configured. However, Athena does not "move" the files -- it simply reads them and then stores the output. If you were to do this for new data each time, you would need to use an INSERT statement to copy the new data into an existing output table, then delete the input files from S3.
Since it is one-off, and each file belongs in only one partition, I would recommend you simply "run the script". It will go slightly faster from an EC2 instance, but the data is not uploaded/downloaded -- it all stays within S3.
I often create an Excel spreadsheet with a list of input locations and output locations. I create a formula to build the aws s3 cp <input> <output_path> commands, copy them to a text file and execute it as a batch. Works fine!
You mention that the destination depends on the data inside the object, so it would probably work well as a Python script that would loop through each object, 'peek' inside the object to see where it belongs, then issue a copy_object() command to send it to the right destination. (smart-open ยท PyPI is a great library for reading from an S3 object without having to download it first.)

Putting a TWS file dependencies on AWS S3 stored file

I have an ETL application which is suppose to migrate to AWS infra. The scheduler being used in my application is Tivoli Work Scheduler and we want to use the same on cloud as well which has file dependencies.
Now when we move to aws , the files to be watched will land in S3 Bucket. Can we put the OPEN dependency for files in S3? If yes, What would be the hostname ( HOST#Filepath ) ?
If Not, what services should be aligned to serve the purpose. I have both time as well as file dependency in my SCHEDULES.
Eg. The file might get uploaded on S3 at 1AM. AT 3 AM my schedule will get triggered, look for the file in S3 bucket. If present, starts execution and if not then it should wait as per other parameters on tws.
Any help or advice would be nice to have.
If I understand this correctly, job triggered at 3am will identify all files uploaded within last e.g. 24 hours.
You can list all s3 files to list everything uploaded within specific period of time.
Better solution would be to create S3 upload trigger which will send information to SQS and have your code inspect the depth (number of messages) there and start processing the files one by one. An additional benefit would be an assurance that all items are processed without having to worry about time overalpse.

Copy limited number of files from S3?

We are using an S3 bucket to store a growing number of small JSON files (~1KB each) that contain some build-related data. Part of our pipeline involves copying these files from S3 and putting them into memory to do some operations.
That copy operation is done via S3 cli tool command that looks something like this:
aws s3 cp s3://bucket-path ~/some/local/path/ --recursive --profile dev-profile
The problem is that the number of json files on S3 is getting pretty large since more are being made every day. It's nothing even close to the capacity of the S3 bucket since the files are so small. However, in practical terms, there's no need to copy ALL these JSON files. Realistically the system would be safe just copying the most recent 100 or so. But we do want to keep older ones around for other purposes.
So my question boils down to: is there a clean way to copy a specific number of files from S3 (maybe sorted by most recent)? Is there some kind of pruning policy we can set on an S3 bucket to delete files older than X days or something?
The aws s3 sync command in the AWS CLI sounds perfect for your needs.
It will copy only files that are New or Modified since the last sync. However, it means that the destination will need to retain a copy of the 'old' files so that they are not copied again.
Alternatively, you could write a script (eg in Python) that lists the objects in S3 and then only copies objects added since the last time the copy was run.
You can set the Lifecycle policies to the S3 buckets which will remove them after certain period of time.
To copy only some days old objects you will need to write a script

AWS CLI S3 rm command does not produce error if file does not exist

I have a bash script which iterates over an array of file names and removes them from S3.
The following command:
aws s3 rm "s3://myBucket/myFolder/myFile.txt"
will produce this output.
delete: s3://myBucket/myFolder/myFile.txt
I can see that the delete was successful by verifying it has been removed in the AWS console.
However if I iterate over the same list again, I get the same output even though the file is gone.
Is there any way -- using just the rm command -- of indicating that AWS CLI tried to delete the file but could not find it?
The s3 cli rm command uses the s3 API Delete Object operation
As you can see in the documentation, this adds a "delete marker" to the object. In a sense, it is "labelled" as deleted.
There doesn't seem to be any check before these markers are made that the underlying object actually exists
As the S3 storage is distributed, consistency isn't guaranteed under all circumstances.
What this means is that if you were to carry out some operations on a file and then check it, the answer wouldn't be certain
In the case of S3 the AWS docs say
Amazon S3 offers eventual consistency for overwrite PUTS and DELETES in all regions.
"eventual consistency" means that at some undefined point in the future all the distributed nodes will catch up with the changes and the results returned from a query will be as expected, given the changes you have done
So basically, this is a long-winded way of saying: No, you can't get a confirmation that the file is deleted. Checking to see if it exists afterwards will not work reliably

aws s3 mv/sync command

I have about 2 million files nested in subfoldrs in a bucket and want to move all of them to another bucket. Spending much of time on searching ... i found a solution to use AWS CLI mv/sync command. use move command or use sync command and then delete all the files after successfully synced.
aws s3 mv s3://mybucket/ s3://mybucket2/ --recursive
or it can be as
aws s3 sync s3://mybucket/ s3://mybucket2/
But the problem is how would i know that how many files/folders have moved or synced and how much time would it take...
And what if some exception occurs(machine/server stops/ internet disconnection due to any reason )...i have to again execute the command or it will for surely complete and move/sync all files. How can i be sure about the number of files moved/synced and files not moved/synced.
or can i have something like that
I move limited number of files e.g 100 thousand.. and repeat until all files are moved...
or move files on the basis of uploaded time.. e.g files uploaded from starting date to ending date
if yes .. how?
To sync them use:
aws s3 sync s3://mybucket/ s3://mybucket2/
You can repeat the command, after it finish (or fail) without issue. This will check if anything is missing/different to the target s3 bucket and will process it again.
The time depends on what size are the files, how much objects you have. Amazon counts directories as an object, so they matter too.