I have an Aws Data Pipeline with an EMR Activity, which writes data on S3. At the end of this process, it also writes some metadata to a specific S3 folder in that location.
Is there a way to trigger an Aws Glue crawler from within a Data Pipelines definition - which scans this last S3 location, so that it creates an Aws Athena table?
I haven't found a way to do this looking in the Aws Data Pipelines documentation.
Maybe you could use ShellCommandActivity and call aws glue start-crawler.
Related
We know that,
the procedure of writing from pyspark script (aws glue job) to AWS data catalog is to write in s3 bucket (eg.csv) use a crawler and schedule it.
Is there any other way of writing to aws glue data catalog?
I am looking for a direct way to do this.Eg. writing as a s3 file and sync to the aws glue data catalog.
You may manually specify the table. The crawler only discovers the schema. If you set the schema manually, you should be able to read your data when you run the AWS Glue Job.
We have had this same problem for one of our customers who had millions of small files within AWS S3. The crawler practically would stall and not proceed and continue to run infinitely. We came up with the following alternative approach :
A Custom Glue Python Shell job was written which leveraged AWS Wrangler to fire queries towards AWS Athena.
The Python Shell job would List the contents of folder s3:///event_date=<Put the Date Here from #2.1>
The queries fired :
alter table add partition (event_date='<event_date from above>',eventname=’List derived from above S3 List output’)
4. This was triggered to run post the main Ingestion Job via Glue Workflows.
If you are not expecting schema to change, use Glue job directly after creating manually tables using Glue Database and Table.
i am new to amazon aws. i have a use case to read ORC files from one s3 bucket, convert that to JSON files and write to another s3 bucket.
Volume is about 100G and roughly a thousand files everyday.
I should be able to run this on-demand or schedule to run daily. what are the options i should consider?
any ideas would be helpful
Amazon Athena
You can use Amazon Athena to convert file formats via the CREATE TABLE AS command. See: Creating a Table from Query Results (CTAS) - Amazon Athena
The question then becomes how to send the commands to Athena. For this, you could schedule an AWS Lambda function to run, which starts an Amazon EC2 instance. Then, run a script on the instance to send all the commands to Amazon Athena. See: Auto-Stop EC2 instances when they finish a task - DEV Community
AWS Glue ETL job
Alternatively, you could create an AWS Glue ETL job that uses Spark to transform the data. See: Built-In Transforms - AWS Glue
I'm looking to write an ETL solution that orchestrates copying data from S3 into an AWS EMR cluster running Apache Hbase.
The steps I'm looking to write are as follows:
CSV file is uploaded to S3 bucket.
Lambda function is triggered, moving the file from S3 into the Hbase cluster's HDFS
Invoke Hbase's ImportTsv utility to bulk load the CSV on HDFS into an Hbase table.
I'm new to the world of AWS, so I'm not sure what the best tools are to orchestrate this workflow. How would I go about implementing this?
I wish to transfer data in a database like MySQL[RDS] to S3 using AWS Glue ETL.
I am having difficulty trying to do this the documentation is really not good.
I found this link here on stackoverflow:
Could we use AWS Glue just copy a file from one S3 folder to another S3 folder?
SO based on this link, it seems that Glue does not have an S3 bucket as a data Destination, it may have it as a data Source.
SO, i hope i am wrong on this.
BUT if one makes an ETL tool, one of the first basics on AWS is for it to tranfer data to and from an S3 bucket, the major form of storage on AWS.
So hope someone can help on this.
You can add a Glue connection to your RDS instance and then use the Spark ETL script to write the data to S3.
You'll have to first crawl the database table using Glue Crawler. This will create a table in the Data Catalog which can be used in the job to transfer the data to S3. If you do not wish to perform any transformation, you may directly use the UI steps for autogenerated ETL scripts.
I have also written a blog on how to Migrate Relational Databases to Amazon S3 using AWS Glue. Let me know if it addresses your query.
https://ujjwalbhardwaj.me/post/migrate-relational-databases-to-amazon-s3-using-aws-glue
Have you tried https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-template-copyrdstos3.html?
You can use AWS Data Pipeline - it has standard templates for full as well incrementation copy to s3 from RDS.
I have my data in a table in Redshift cluster. I want to periodically run a query against the Redshift table and store the results in a S3 bucket.
I will be running some data transformations on this data in the S3 bucket to feed into another system. As per AWS documentation I can use the UNLOAD command, but is there a way to schedule this periodically? I have searched a lot but I haven't found any relevant information around this.
You can use a scheduling tool like Airflow to accomplish this task. Airflow seem-lessly connects to Redshift and S3. You can have a DAG action, which polls Redshift periodically and unloads the data from Redshift onto S3.
I don't believe Redshift has the ability to schedule queries periodically. You would need to use another service for this. You could use a Lambda function, or you could schedule a cron job on an EC2 instance.
I believe you are looking for AWS data pipeline service.
You can copy data from redshift to s3 using the RedshiftCopyActivity (http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-redshiftcopyactivity.html).
I am copying the relevant content from the above URL for future purposes:
"You can also copy from Amazon Redshift to Amazon S3 using RedshiftCopyActivity. For more information, see S3DataNode.
You can use SqlActivity to perform SQL queries on the data that you've loaded into Amazon Redshift."
Let me know if this helped.
You should try AWS Data Pipelines. You can schedule them to run periodically or on demand. I am confident that it would solve your use case