Is there any support for running Athena queries on a schedule? We want to query some data daily, and dump a summarized CSV file, but it would be best if this happened on an automated schedule.
Schedule an AWS Lambda task to kick this off, or use a cron job on one of your servers.
Related
We know that,
the procedure of writing from pyspark script (aws glue job) to AWS data catalog is to write in s3 bucket (eg.csv) use a crawler and schedule it.
Is there any other way of writing to aws glue data catalog?
I am looking for a direct way to do this.Eg. writing as a s3 file and sync to the aws glue data catalog.
You may manually specify the table. The crawler only discovers the schema. If you set the schema manually, you should be able to read your data when you run the AWS Glue Job.
We have had this same problem for one of our customers who had millions of small files within AWS S3. The crawler practically would stall and not proceed and continue to run infinitely. We came up with the following alternative approach :
A Custom Glue Python Shell job was written which leveraged AWS Wrangler to fire queries towards AWS Athena.
The Python Shell job would List the contents of folder s3:///event_date=<Put the Date Here from #2.1>
The queries fired :
alter table add partition (event_date='<event_date from above>',eventname=’List derived from above S3 List output’)
4. This was triggered to run post the main Ingestion Job via Glue Workflows.
If you are not expecting schema to change, use Glue job directly after creating manually tables using Glue Database and Table.
We had a scheduled AWS Lambda job that went out and ran an Athena query daily and inserted that data into a 3rd party utility.
This utility is no longer in use and the lambda job has been deleted but there are still a lot of the Athena queries that were run showing up in our Saved Queries list. Is there a way to batch delete these queries other than writing a utility to do it as going into Athena and selecting each one for deletion is very time consuming and ends up not being done.
You could write a script that calls ListNamedQueries, then loops through them and calls DeleteNamedQuery.
This could be done with the AWS CLI or any AWS SDK.
I'm trying to create the Architecture on AWS where a lambda function runs SQL Code to refresh a materialized view on AWS Redshift. I would like the materialized view to refresh after the daily ETL processes have completed on the Redshift cluster. Is there a way of setting up the lambda function to be triggered after a particular SQL command on the Redshift Cluster has completed?
Unfortunately, I've only seen examples of people scheduling the Lambda Function to run on particular intervals/at a particular time. Any help would be much appreciated.
A couple of ways that this can be done (out of many):
Have the ETL process trigger the Lambda - this is straight forward
if the ETL tool can generate the trigger but organizational factors
can make changing ETL frameworks difficult.
Use an S3 semaphore - have your ETL SQL UNLOAD some small data (like
a text string of metadata) to S3 where the objects creation will
trigger the Lambda. Insert the UNLOAD at the point in the ETL SQL
where you want the update to occur.
I am trying to populate maximum possible Glue job metrics for some testing, below is the setup I have created:
A crawler reads data (dummy customer data of 500 rows) from a CSV file placed in an S3 bucket.
Used another crawler to crawl tables created in Redshift cluster.
An ETL job finally reads data from csv file in s3 and dumps it into a Redshift table.
The job is running without any issue and i am able to see final data getting dumped into Redshift table, however, in the end, only below 5 Cloudwatch metrics are being populated:
glue.jvm.heap.usage
glue.jvm.heap.used
glue.s3.filesystem.read_bytes
glue.s3.filesystem.write_bytes
glue.system.cpuSystemLoad
There are approximately 20 more metrics which are not getting populated.
Any suggestions on how to populate those remaining metrics as well?
Double check if the CW metrics for your job is enabled
Make sure your job runs longer say > 3mins such that it allows CW to push the metrics
For this you can add a sleep time in your code
Assuming that you are using Glue version 2.0+ for the above job, please be advised that AWS Glue version 2.0+ does not use dynamic allocation, hence the ExecutorAllocationManager metrics are not available. Trackback on using Glue 1.0 and you should confirm that all the documented metrics are now available.
Met the same issue. Does your glue.s3.filesystem.read_bytes and glue.s3.filesystem.write_bytes have any data?
One possible reason is that the AWS Glue job metrics not emitted if job completes in less then 30 sec
While running the job enable the metrics option under monitoring tab.
I have my data in a table in Redshift cluster. I want to periodically run a query against the Redshift table and store the results in a S3 bucket.
I will be running some data transformations on this data in the S3 bucket to feed into another system. As per AWS documentation I can use the UNLOAD command, but is there a way to schedule this periodically? I have searched a lot but I haven't found any relevant information around this.
You can use a scheduling tool like Airflow to accomplish this task. Airflow seem-lessly connects to Redshift and S3. You can have a DAG action, which polls Redshift periodically and unloads the data from Redshift onto S3.
I don't believe Redshift has the ability to schedule queries periodically. You would need to use another service for this. You could use a Lambda function, or you could schedule a cron job on an EC2 instance.
I believe you are looking for AWS data pipeline service.
You can copy data from redshift to s3 using the RedshiftCopyActivity (http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-redshiftcopyactivity.html).
I am copying the relevant content from the above URL for future purposes:
"You can also copy from Amazon Redshift to Amazon S3 using RedshiftCopyActivity. For more information, see S3DataNode.
You can use SqlActivity to perform SQL queries on the data that you've loaded into Amazon Redshift."
Let me know if this helped.
You should try AWS Data Pipelines. You can schedule them to run periodically or on demand. I am confident that it would solve your use case