Anyone know how to get rid of all the temporary files that get created in the S3 buckets when using Athena to query?
Is there some setting or option to disable these -- or criteria to filter how to remove them?
I'm using JDBC connection via linux to select from my S3 bucket.
Amazon Athena creates files in Amazon S3 with the output of all Athena queries. This is beneficial, because the output can then be used in a subsequent process. Also, it could avoid the need to re-run queries which is useful because Athena is charged based on data scanned for each query.
If you do not wish to keep these output files, or if you wish to remove them after a period of time, the easiest method is to configure Object Lifecycle Management on the Amazon S3 bucket. Simply create an expiration policy that deletes the files after a certain number of days. The files will then be deleted each night (or thereabouts).
Related
I have multiple files present in different buckets in S3. I need to move these files to Amazon Aurora PostgreSQL every day on a schedule. Every day I will get a new file and, based on the data, insert or update will happen. I was using Glue for insert but with upsert Glue doesn't seem to be the right option. Is there a better way to handle this? I saw Load command from S3 to RDS will solve the issue but didn't get enough details on it. Any recommendations please?
You can trigger a Lambda function from S3 events, that could then process the file(s) and inject them into Aurora. Alternatively you can create a cron-type function that will run daily on whatever schedule you define.
https://docs.aws.amazon.com/lambda/latest/dg/with-s3-example.html
The Amazon Redshift documentation states that the best way to load data into the database is by using the COPY function. How can I run it automatically every day with a data file uploaded to S3?
The longer version: I have launched a Redshift cluster and set up the database. I have created an S3 bucket and uploaded a CSV file. Now from the Redshift Query editor, I can easily run the COPY function manually. How do I automate this?
Before you finalize your approach you should consider below important points:
If possible, compress csv files into gzips and then ingest into corresponding redshift tables. This will reduce your file size with a good margin and will increase overall data ingestion performance.
Finalize the compression scheme on table columns. If you want redshift to do the job, auto compression can be enabled with "COMPUPDATE ON" in copy command. Refer aws documentation
Now, to answer your question:
As you have created S3 bucket for the same, create directories for each table and place your files there. If your input files are large, split them into multiple files ( number of files should be chosen according to number of nodes you have, to enable better parallel ingestion, refer aws doc for more details).
Your copy command should look something like this :
PGPASSWORD=<password> psql -h <host> -d <dbname> -p 5439 -U <username> -c "copy <table_name> from 's3://<bucket>/<table_dir_path>/' credentials 'aws_iam_role=<iam role identifier to ingest s3 files into redshift>' delimiter ',' region '<region>' GZIP COMPUPDATE ON REMOVEQUOTES IGNOREHEADER 1"
next step it to create lambda and enable sns over redshift s3 bucket, this sns should trigger lambda as soon as you receive new files at s3 bucket. Alternate method would be to set cloudwatch scheduler to run the lambda.
Lambda can be created(java/python or any lang) which reads s3 files, connect to redshift and ingest files into tables using copy command.
Lambda has 15 mins limit, if that is a concern to you then fargate would be better. Running jobs on EC2 will cause more billing than lambda or fargate ( in case you forget to turn off ec2 machine)
You could create an external table over your bucket. Redshift would automatically scan all the files in the bucket. But bare in mind that the performance of queries may not be as good as with data loaded via COPY, but what you gain is no scheduler needed.
Also once you have an external table you could load it once to redshift with a single CREATE TABLE AS SELECT ... FROM your_external_table. The benefit of that approach is that it's idempotent - you don't need to keep track of your files - it will always load all data from all files in the bucket.
I have recently joined a company that uses S3 Buckets for various different projects within AWS. I want to identify and potentially delete S3 Objects that are not being accessed (read and write), in an effort to reduce the cost of S3 in my AWS account.
I read this, which helped me to some extent.
Is there a way to find out which objects are being accessed and which are not?
There is no native way of doing this at the moment, so all the options are workarounds depending on your usecase.
You have a few options:
Tag each S3 Object (e.g. 2018-10-24). First turn on Object Level Logging for your S3 bucket. Set up CloudWatch Events for CloudTrail. The Tag could then be updated by a Lambda Function which runs on a CloudWatch Event, which is fired on a Get event. Then create a function that runs on a Scheduled CloudWatch Event to delete all objects with a date tag prior to today.
Query CloudTrail logs on, write a custom function to query the last access times from Object Level CloudTrail Logs. This could be done with Athena, or a direct query to S3.
Create a Separate Index, in something like DynamoDB, which you update in your application on read activities.
Use a Lifecycle Policy on the S3 Bucket / key prefix to archive or delete the objects after x days. This is based on upload time rather than last access time, so you could copy the object to itself to reset the timestamp and start the clock again.
No objects in Amazon S3 are required by other AWS services, but you might have configured services to use the files.
For example, you might be serving content through Amazon CloudFront, providing templates for AWS CloudFormation or transcoding videos that are stored in Amazon S3.
If you didn't create the files and you aren't knowingly using the files, can you probably delete them. But you would be the only person who would know whether they are necessary.
There is recent AWS blog post which I found very interesting and cost optimized approach to solve this problem.
Here is the description from AWS blog:
The S3 server access logs capture S3 object requests. These are generated and stored in the target S3 bucket.
An S3 inventory report is generated for the source bucket daily. It is written to the S3 inventory target bucket.
An Amazon EventBridge rule is configured that will initiate an AWS Lambda function once a day, or as desired.
The Lambda function initiates an S3 Batch Operation job to tag objects in the source bucket. These must be expired using the following logic:
Capture the number of days (x) configuration from the S3 Lifecycle configuration.
Run an Amazon Athena query that will get the list of objects from the S3 inventory report and server access logs. Create a delta list with objects that were created earlier than 'x' days, but not accessed during that time.
Write a manifest file with the list of these objects to an S3 bucket.
Create an S3 Batch operation job that will tag all objects in the manifest file with a tag of "delete=True".
The Lifecycle rule on the source S3 bucket will expire all objects that were created prior to 'x' days. They will have the tag given via the S3 batch operation of "delete=True".
Expiring Amazon S3 Objects Based on Last Accessed Date to Decrease Costs
I have several JSON files in an S3 bucket. I need to do a monthly count of the amount of put/gets each file receives in a month.
Can these be extracted via CSV or even accessed via an API? I have looked at Cloudwatch and there doesn't appear to be an option for this, or within the billing dashboard.
If this feature doesn't exist, are there any workarounds such as a Lamba function with a counter?
Enable bucket logs under -
s3 > bucket > properties > server access logging > configure target
bucket/prefix
Use Athena to query this data using simple SQL statements. Read more about Athena HERE
You can set up access logging for S3 buckets.
https://docs.aws.amazon.com/AmazonS3/latest/dev/ServerLogs.html
Then you get the ability to export these logs. After that you can do anything with the logs. E.g. a bash script that can count how many requests each file gets.
https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/S3ExportTasksConsole.html
I would like to shut down my Redshift cluster but would like to keep a backup of it.
I understood i can create a Manual Snapshot and it will be save to S3.
To reduce the costs even more, i would like to move the Snapshot from S3 to Glacier but can't find the Snapshot in my S3 account.
Where is the Snapshot being saved? Is AWS keeping it in a different account?
Or maybe i am not at all going the right way, should i be backing up my Redshift cluster differently?
Thank you,
Oren.
It's not stored in any of your account's S3 buckets. It's being stored "behind the scenes" in S3. Amazon only makes a point of telling you it is stored in S3 so you understand the fault-tolerance involved in the storage of your snapshots. If you need to store a backup in one of your S3 buckets you would need to do a pg_dump of the database and copy the dump file to S3.
You can use Redshift's UNLOAD to dump tables straight to an S3 bucket. Unfortunately you need to do it separately for each table. You'll also want to archive all the schema queries, CREATE, etc. for your tables (the pg_dump solution doesn't have this problem since it can capture the table definitions, but requires local storage of the files and manual push to S3...might be worth it though for a case like archive and complete shutdown).
UNLOAD('select * from your_table') TO 's3://your_bucket/your_table.csv'
WITH CREDENTIALS 'aws_access_key_id=YOUR_KEY;aws_secret_access_key=YOUR_SECRET'
DELIMITER ',' NULL 'null' ALLOWOVERWRITE;
Once you have all your tables in an S3 bucket you can set the lifecycle (rules created in the Properties pane of the bucket on the Lifecycle panel) to archive to glacier storage class.
It's a bit confusing because Glacier is it's own service, but when you archive via lifecycle the files stay in the S3 bucket. You can tell they're in glacier by selecting a file in the S3 console, selecting the properties pane and opening the Details panel. There is should say Storage class: Glacier.
If you ever need to restore you'll use the COPY command:
COPY your_table FROM 's3://your_bucket/your_table.csv'
CREDENTIALS 'aws_access_key_id=[YOURKEY];aws_secret_access_key=[YOURSECRET]'
DELIMITER ',' NULL 'null' IGNOREBLANKLINES EMPTYASNULL
BLANKSASNULL TIMEFORMAT 'auto' FILLRECORD MAXERROR 1