The Amazon Redshift documentation states that the best way to load data into the database is by using the COPY function. How can I run it automatically every day with a data file uploaded to S3?
The longer version: I have launched a Redshift cluster and set up the database. I have created an S3 bucket and uploaded a CSV file. Now from the Redshift Query editor, I can easily run the COPY function manually. How do I automate this?
Before you finalize your approach you should consider below important points:
If possible, compress csv files into gzips and then ingest into corresponding redshift tables. This will reduce your file size with a good margin and will increase overall data ingestion performance.
Finalize the compression scheme on table columns. If you want redshift to do the job, auto compression can be enabled with "COMPUPDATE ON" in copy command. Refer aws documentation
Now, to answer your question:
As you have created S3 bucket for the same, create directories for each table and place your files there. If your input files are large, split them into multiple files ( number of files should be chosen according to number of nodes you have, to enable better parallel ingestion, refer aws doc for more details).
Your copy command should look something like this :
PGPASSWORD=<password> psql -h <host> -d <dbname> -p 5439 -U <username> -c "copy <table_name> from 's3://<bucket>/<table_dir_path>/' credentials 'aws_iam_role=<iam role identifier to ingest s3 files into redshift>' delimiter ',' region '<region>' GZIP COMPUPDATE ON REMOVEQUOTES IGNOREHEADER 1"
next step it to create lambda and enable sns over redshift s3 bucket, this sns should trigger lambda as soon as you receive new files at s3 bucket. Alternate method would be to set cloudwatch scheduler to run the lambda.
Lambda can be created(java/python or any lang) which reads s3 files, connect to redshift and ingest files into tables using copy command.
Lambda has 15 mins limit, if that is a concern to you then fargate would be better. Running jobs on EC2 will cause more billing than lambda or fargate ( in case you forget to turn off ec2 machine)
You could create an external table over your bucket. Redshift would automatically scan all the files in the bucket. But bare in mind that the performance of queries may not be as good as with data loaded via COPY, but what you gain is no scheduler needed.
Also once you have an external table you could load it once to redshift with a single CREATE TABLE AS SELECT ... FROM your_external_table. The benefit of that approach is that it's idempotent - you don't need to keep track of your files - it will always load all data from all files in the bucket.
Related
i am new to amazon aws. i have a use case to read ORC files from one s3 bucket, convert that to JSON files and write to another s3 bucket.
Volume is about 100G and roughly a thousand files everyday.
I should be able to run this on-demand or schedule to run daily. what are the options i should consider?
any ideas would be helpful
Amazon Athena
You can use Amazon Athena to convert file formats via the CREATE TABLE AS command. See: Creating a Table from Query Results (CTAS) - Amazon Athena
The question then becomes how to send the commands to Athena. For this, you could schedule an AWS Lambda function to run, which starts an Amazon EC2 instance. Then, run a script on the instance to send all the commands to Amazon Athena. See: Auto-Stop EC2 instances when they finish a task - DEV Community
AWS Glue ETL job
Alternatively, you could create an AWS Glue ETL job that uses Spark to transform the data. See: Built-In Transforms - AWS Glue
I have multiple files present in different buckets in S3. I need to move these files to Amazon Aurora PostgreSQL every day on a schedule. Every day I will get a new file and, based on the data, insert or update will happen. I was using Glue for insert but with upsert Glue doesn't seem to be the right option. Is there a better way to handle this? I saw Load command from S3 to RDS will solve the issue but didn't get enough details on it. Any recommendations please?
You can trigger a Lambda function from S3 events, that could then process the file(s) and inject them into Aurora. Alternatively you can create a cron-type function that will run daily on whatever schedule you define.
https://docs.aws.amazon.com/lambda/latest/dg/with-s3-example.html
I am trying to take sql data stored in a csv file in an s3 bucket and transfer the data to AWS Redshift and automate that process. Would writing etl scripts with lambda/glue be the best way to approach this problem, and if so, how do I get the script/transfer to run periodically? If not, what would be the most optimal way to pipeline data from s3 to Redshift.
Tried using AWS Pipeline but that is not available in my region. I also tried to use the AWS documentation for Lambda and Glue but don't know where to find the exact solution to the problem
All systems (including AWS Data Pipeline) use the Amazon Redshift COPY command to load data from Amazon S3.
Therefore, you could write an AWS Lambda function that connects to Redshift and issues the COPY command. You'll need to include a compatible library (eg psycopg2) to be able to call Redshift.
You can use Amazon CloudWatch Events to call the Lambda function on a regular schedule. Or, you could get fancy and configure Amazon S3 Events so that, when a file is dropped in an S3 bucket, it automatically triggers the Lambda function.
If you don't want to write it yourself, you could search for existing code on the web, including:
The very simply Python-based christianhxc/aws-lambda-redshift-copy: AWS Lambda function that runs the copy command into Redshift
A more fully-featured node-based A Zero-Administration Amazon Redshift Database Loader | AWS Big Data Blog
Anyone know how to get rid of all the temporary files that get created in the S3 buckets when using Athena to query?
Is there some setting or option to disable these -- or criteria to filter how to remove them?
I'm using JDBC connection via linux to select from my S3 bucket.
Amazon Athena creates files in Amazon S3 with the output of all Athena queries. This is beneficial, because the output can then be used in a subsequent process. Also, it could avoid the need to re-run queries which is useful because Athena is charged based on data scanned for each query.
If you do not wish to keep these output files, or if you wish to remove them after a period of time, the easiest method is to configure Object Lifecycle Management on the Amazon S3 bucket. Simply create an expiration policy that deletes the files after a certain number of days. The files will then be deleted each night (or thereabouts).
I have my data in a table in Redshift cluster. I want to periodically run a query against the Redshift table and store the results in a S3 bucket.
I will be running some data transformations on this data in the S3 bucket to feed into another system. As per AWS documentation I can use the UNLOAD command, but is there a way to schedule this periodically? I have searched a lot but I haven't found any relevant information around this.
You can use a scheduling tool like Airflow to accomplish this task. Airflow seem-lessly connects to Redshift and S3. You can have a DAG action, which polls Redshift periodically and unloads the data from Redshift onto S3.
I don't believe Redshift has the ability to schedule queries periodically. You would need to use another service for this. You could use a Lambda function, or you could schedule a cron job on an EC2 instance.
I believe you are looking for AWS data pipeline service.
You can copy data from redshift to s3 using the RedshiftCopyActivity (http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-redshiftcopyactivity.html).
I am copying the relevant content from the above URL for future purposes:
"You can also copy from Amazon Redshift to Amazon S3 using RedshiftCopyActivity. For more information, see S3DataNode.
You can use SqlActivity to perform SQL queries on the data that you've loaded into Amazon Redshift."
Let me know if this helped.
You should try AWS Data Pipelines. You can schedule them to run periodically or on demand. I am confident that it would solve your use case