i am new to amazon aws. i have a use case to read ORC files from one s3 bucket, convert that to JSON files and write to another s3 bucket.
Volume is about 100G and roughly a thousand files everyday.
I should be able to run this on-demand or schedule to run daily. what are the options i should consider?
any ideas would be helpful
Amazon Athena
You can use Amazon Athena to convert file formats via the CREATE TABLE AS command. See: Creating a Table from Query Results (CTAS) - Amazon Athena
The question then becomes how to send the commands to Athena. For this, you could schedule an AWS Lambda function to run, which starts an Amazon EC2 instance. Then, run a script on the instance to send all the commands to Amazon Athena. See: Auto-Stop EC2 instances when they finish a task - DEV Community
AWS Glue ETL job
Alternatively, you could create an AWS Glue ETL job that uses Spark to transform the data. See: Built-In Transforms - AWS Glue
Related
We know that,
the procedure of writing from pyspark script (aws glue job) to AWS data catalog is to write in s3 bucket (eg.csv) use a crawler and schedule it.
Is there any other way of writing to aws glue data catalog?
I am looking for a direct way to do this.Eg. writing as a s3 file and sync to the aws glue data catalog.
You may manually specify the table. The crawler only discovers the schema. If you set the schema manually, you should be able to read your data when you run the AWS Glue Job.
We have had this same problem for one of our customers who had millions of small files within AWS S3. The crawler practically would stall and not proceed and continue to run infinitely. We came up with the following alternative approach :
A Custom Glue Python Shell job was written which leveraged AWS Wrangler to fire queries towards AWS Athena.
The Python Shell job would List the contents of folder s3:///event_date=<Put the Date Here from #2.1>
The queries fired :
alter table add partition (event_date='<event_date from above>',eventname=’List derived from above S3 List output’)
4. This was triggered to run post the main Ingestion Job via Glue Workflows.
If you are not expecting schema to change, use Glue job directly after creating manually tables using Glue Database and Table.
I'm looking to write an ETL solution that orchestrates copying data from S3 into an AWS EMR cluster running Apache Hbase.
The steps I'm looking to write are as follows:
CSV file is uploaded to S3 bucket.
Lambda function is triggered, moving the file from S3 into the Hbase cluster's HDFS
Invoke Hbase's ImportTsv utility to bulk load the CSV on HDFS into an Hbase table.
I'm new to the world of AWS, so I'm not sure what the best tools are to orchestrate this workflow. How would I go about implementing this?
I am trying to take sql data stored in a csv file in an s3 bucket and transfer the data to AWS Redshift and automate that process. Would writing etl scripts with lambda/glue be the best way to approach this problem, and if so, how do I get the script/transfer to run periodically? If not, what would be the most optimal way to pipeline data from s3 to Redshift.
Tried using AWS Pipeline but that is not available in my region. I also tried to use the AWS documentation for Lambda and Glue but don't know where to find the exact solution to the problem
All systems (including AWS Data Pipeline) use the Amazon Redshift COPY command to load data from Amazon S3.
Therefore, you could write an AWS Lambda function that connects to Redshift and issues the COPY command. You'll need to include a compatible library (eg psycopg2) to be able to call Redshift.
You can use Amazon CloudWatch Events to call the Lambda function on a regular schedule. Or, you could get fancy and configure Amazon S3 Events so that, when a file is dropped in an S3 bucket, it automatically triggers the Lambda function.
If you don't want to write it yourself, you could search for existing code on the web, including:
The very simply Python-based christianhxc/aws-lambda-redshift-copy: AWS Lambda function that runs the copy command into Redshift
A more fully-featured node-based A Zero-Administration Amazon Redshift Database Loader | AWS Big Data Blog
I have an Aws Data Pipeline with an EMR Activity, which writes data on S3. At the end of this process, it also writes some metadata to a specific S3 folder in that location.
Is there a way to trigger an Aws Glue crawler from within a Data Pipelines definition - which scans this last S3 location, so that it creates an Aws Athena table?
I haven't found a way to do this looking in the Aws Data Pipelines documentation.
Maybe you could use ShellCommandActivity and call aws glue start-crawler.
I have my data in a table in Redshift cluster. I want to periodically run a query against the Redshift table and store the results in a S3 bucket.
I will be running some data transformations on this data in the S3 bucket to feed into another system. As per AWS documentation I can use the UNLOAD command, but is there a way to schedule this periodically? I have searched a lot but I haven't found any relevant information around this.
You can use a scheduling tool like Airflow to accomplish this task. Airflow seem-lessly connects to Redshift and S3. You can have a DAG action, which polls Redshift periodically and unloads the data from Redshift onto S3.
I don't believe Redshift has the ability to schedule queries periodically. You would need to use another service for this. You could use a Lambda function, or you could schedule a cron job on an EC2 instance.
I believe you are looking for AWS data pipeline service.
You can copy data from redshift to s3 using the RedshiftCopyActivity (http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-redshiftcopyactivity.html).
I am copying the relevant content from the above URL for future purposes:
"You can also copy from Amazon Redshift to Amazon S3 using RedshiftCopyActivity. For more information, see S3DataNode.
You can use SqlActivity to perform SQL queries on the data that you've loaded into Amazon Redshift."
Let me know if this helped.
You should try AWS Data Pipelines. You can schedule them to run periodically or on demand. I am confident that it would solve your use case