Transform files from one S3 bucket to another

Transform files from one S3 bucket to another - amazon-web-services

i am new to amazon aws. i have a use case to read ORC files from one s3 bucket, convert that to JSON files and write to another s3 bucket.
Volume is about 100G and roughly a thousand files everyday.
I should be able to run this on-demand or schedule to run daily. what are the options i should consider?
any ideas would be helpful

Amazon Athena
You can use Amazon Athena to convert file formats via the CREATE TABLE AS command. See: Creating a Table from Query Results (CTAS) - Amazon Athena
The question then becomes how to send the commands to Athena. For this, you could schedule an AWS Lambda function to run, which starts an Amazon EC2 instance. Then, run a script on the instance to send all the commands to Amazon Athena. See: Auto-Stop EC2 instances when they finish a task - DEV Community
AWS Glue ETL job
Alternatively, you could create an AWS Glue ETL job that uses Spark to transform the data. See: Built-In Transforms - AWS Glue

Related

AWS glue job (Pyspark) to AWS glue data catalog

We know that,
the procedure of writing from pyspark script (aws glue job) to AWS data catalog is to write in s3 bucket (eg.csv) use a crawler and schedule it.
Is there any other way of writing to aws glue data catalog?
I am looking for a direct way to do this.Eg. writing as a s3 file and sync to the aws glue data catalog.

You may manually specify the table. The crawler only discovers the schema. If you set the schema manually, you should be able to read your data when you run the AWS Glue Job.

We have had this same problem for one of our customers who had millions of small files within AWS S3. The crawler practically would stall and not proceed and continue to run infinitely. We came up with the following alternative approach :
A Custom Glue Python Shell job was written which leveraged AWS Wrangler to fire queries towards AWS Athena.
The Python Shell job would List the contents of folder s3:///event_date=<Put the Date Here from #2.1>
The queries fired :
alter table add partition (event_date='<event_date from above>',eventname=’List derived from above S3 List output’)
4. This was triggered to run post the main Ingestion Job via Glue Workflows.

If you are not expecting schema to change, use Glue job directly after creating manually tables using Glue Database and Table.

Using AWS Lambda to trigger EMR step to bulk load data to Hbase cluster

I'm looking to write an ETL solution that orchestrates copying data from S3 into an AWS EMR cluster running Apache Hbase.
The steps I'm looking to write are as follows:
CSV file is uploaded to S3 bucket.
Lambda function is triggered, moving the file from S3 into the Hbase cluster's HDFS
Invoke Hbase's ImportTsv utility to bulk load the CSV on HDFS into an Hbase table.
I'm new to the world of AWS, so I'm not sure what the best tools are to orchestrate this workflow. How would I go about implementing this?

What is the most optimal way to automate data (csv file) transfer from s3 to Redshift without AWS Pipeline?

I am trying to take sql data stored in a csv file in an s3 bucket and transfer the data to AWS Redshift and automate that process. Would writing etl scripts with lambda/glue be the best way to approach this problem, and if so, how do I get the script/transfer to run periodically? If not, what would be the most optimal way to pipeline data from s3 to Redshift.
Tried using AWS Pipeline but that is not available in my region. I also tried to use the AWS documentation for Lambda and Glue but don't know where to find the exact solution to the problem

All systems (including AWS Data Pipeline) use the Amazon Redshift COPY command to load data from Amazon S3.
Therefore, you could write an AWS Lambda function that connects to Redshift and issues the COPY command. You'll need to include a compatible library (eg psycopg2) to be able to call Redshift.
You can use Amazon CloudWatch Events to call the Lambda function on a regular schedule. Or, you could get fancy and configure Amazon S3 Events so that, when a file is dropped in an S3 bucket, it automatically triggers the Lambda function.
If you don't want to write it yourself, you could search for existing code on the web, including:
The very simply Python-based christianhxc/aws-lambda-redshift-copy: AWS Lambda function that runs the copy command into Redshift
A more fully-featured node-based A Zero-Administration Amazon Redshift Database Loader | AWS Big Data Blog

Aws data pipeline trigger aws glue crawler

I have an Aws Data Pipeline with an EMR Activity, which writes data on S3. At the end of this process, it also writes some metadata to a specific S3 folder in that location.
Is there a way to trigger an Aws Glue crawler from within a Data Pipelines definition - which scans this last S3 location, so that it creates an Aws Athena table?
I haven't found a way to do this looking in the Aws Data Pipelines documentation.

Maybe you could use ShellCommandActivity and call aws glue start-crawler.

Periodically moving query results from Redshift to S3 bucket

I have my data in a table in Redshift cluster. I want to periodically run a query against the Redshift table and store the results in a S3 bucket.
I will be running some data transformations on this data in the S3 bucket to feed into another system. As per AWS documentation I can use the UNLOAD command, but is there a way to schedule this periodically? I have searched a lot but I haven't found any relevant information around this.

You can use a scheduling tool like Airflow to accomplish this task. Airflow seem-lessly connects to Redshift and S3. You can have a DAG action, which polls Redshift periodically and unloads the data from Redshift onto S3.

I don't believe Redshift has the ability to schedule queries periodically. You would need to use another service for this. You could use a Lambda function, or you could schedule a cron job on an EC2 instance.

I believe you are looking for AWS data pipeline service.
You can copy data from redshift to s3 using the RedshiftCopyActivity (http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-redshiftcopyactivity.html).
I am copying the relevant content from the above URL for future purposes:
"You can also copy from Amazon Redshift to Amazon S3 using RedshiftCopyActivity. For more information, see S3DataNode.
You can use SqlActivity to perform SQL queries on the data that you've loaded into Amazon Redshift."
Let me know if this helped.

You should try AWS Data Pipelines. You can schedule them to run periodically or on demand. I am confident that it would solve your use case

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Transform files from one S3 bucket to another - amazon-web-services

Related

AWS glue job (Pyspark) to AWS glue data catalog

Using AWS Lambda to trigger EMR step to bulk load data to Hbase cluster

What is the most optimal way to automate data (csv file) transfer from s3 to Redshift without AWS Pipeline?

Aws data pipeline trigger aws glue crawler

Periodically moving query results from Redshift to S3 bucket

Categories

Resources