AWS glue job (Pyspark) to AWS glue data catalog - amazon-web-services

We know that,
the procedure of writing from pyspark script (aws glue job) to AWS data catalog is to write in s3 bucket (eg.csv) use a crawler and schedule it.
Is there any other way of writing to aws glue data catalog?
I am looking for a direct way to do this.Eg. writing as a s3 file and sync to the aws glue data catalog.

You may manually specify the table. The crawler only discovers the schema. If you set the schema manually, you should be able to read your data when you run the AWS Glue Job.

We have had this same problem for one of our customers who had millions of small files within AWS S3. The crawler practically would stall and not proceed and continue to run infinitely. We came up with the following alternative approach :
A Custom Glue Python Shell job was written which leveraged AWS Wrangler to fire queries towards AWS Athena.
The Python Shell job would List the contents of folder s3:///event_date=<Put the Date Here from #2.1>
The queries fired :
alter table add partition (event_date='<event_date from above>',eventname=’List derived from above S3 List output’)
4. This was triggered to run post the main Ingestion Job via Glue Workflows.

If you are not expecting schema to change, use Glue job directly after creating manually tables using Glue Database and Table.

Related

Batch file processing in AWS using Data Pipeline

I have a requirement of reading a csv batch file that was uploaded to s3 bucket, encrypt data in some columns and persist this data in a Dynamo DB table. While persisting each row in the DynamoDB table, depending on the data in each row, I need to generate an ID and store that in the DynamoDB table too. It seems AWS Data pipeline allows to create a job to import S3 bucket files into DynanoDB, but I can't find a way to add a custom logic there to encrypt some of the column values in the file and add custom logic to generate the id mentioned above.
Is there any way that I can achieve this requirement using AWS Data Pipeline? If not what would the best approach that I can follow using AWS services?
We also have a situation where we need fetch data from S3 and populate it to DynamoDb after performing some transformations (business logic).
We also use AWS DataPipeline for this process.
We first trigger a EMR cluster from Data Pipeline where we fetch the data from S3 and then transform it and populate the DynamoDB(DDB). You can include all the logic you require in the EMR cluster.
We have a timer set in the pipeline which triggers the EMR cluster every day once to perform the task.
This can be having additional costs too.

AWS Glue ETL : transfer data to S3 Bucket

I wish to transfer data in a database like MySQL[RDS] to S3 using AWS Glue ETL.
I am having difficulty trying to do this the documentation is really not good.
I found this link here on stackoverflow:
Could we use AWS Glue just copy a file from one S3 folder to another S3 folder?
SO based on this link, it seems that Glue does not have an S3 bucket as a data Destination, it may have it as a data Source.
SO, i hope i am wrong on this.
BUT if one makes an ETL tool, one of the first basics on AWS is for it to tranfer data to and from an S3 bucket, the major form of storage on AWS.
So hope someone can help on this.
You can add a Glue connection to your RDS instance and then use the Spark ETL script to write the data to S3.
You'll have to first crawl the database table using Glue Crawler. This will create a table in the Data Catalog which can be used in the job to transfer the data to S3. If you do not wish to perform any transformation, you may directly use the UI steps for autogenerated ETL scripts.
I have also written a blog on how to Migrate Relational Databases to Amazon S3 using AWS Glue. Let me know if it addresses your query.
https://ujjwalbhardwaj.me/post/migrate-relational-databases-to-amazon-s3-using-aws-glue
Have you tried https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-template-copyrdstos3.html?
You can use AWS Data Pipeline - it has standard templates for full as well incrementation copy to s3 from RDS.

Pipelining Athena query after ETL from glue crawler

I have data that is coming into an S3 bucket and I would like to run a query on it every hour. The data comes in as a JSON. I crawl it, run a job on the data to transform it to ORC format, and crawl it again to create a table that's faster for queries than the original JSONs (as they are deeply nested). I'm trying to query the data with Athena. I have managed to link the previous steps together using Lambda and cloudwatch events.
The problem here is that the last crawler is supposed to create new tables instead of just partitions of the same table, so the table name is not known prior to running the list of jobs. I found that you can listen for the creation of a new table and the completion of a crawler, but the log for the end of a crawler's run doesn't contain the name of the new table created (using Amazon's Documentation). Is there a way to get this table name dynamically and query it using Lambda or Athena? Thanks
Why not invoke lambda from glue job after crawler completes? Table name is folder in S3 bucket in which you stored orc data. Since it is done in glue job, I believe you already have folder name which you can pass to lambda from glue job.

Can Amazon Athena be used to query a dynamic schema?

I have a service running that populates my S3 bucket with the compressed log files, but the log files do not have a fixed schema and athena expects a fixed schema. (Which I wrote while creating the table)
So my question is as in the title, is there any way around through which I can query a dynamic schema? If not is there any other service like athena to do the same thing?
Amazon Athena can't do that by itself, but you can configure an AWS Glue crawler to automatically infer the schema of your JSON files. The crawler can run on a schedule, so your files will be indexed automatically even if the schema changes. Athena will use the Glue data catalog if AWS Glue is available in the region you're running Athena in.
See Cataloging Tables with a Crawler in the AWS Glue docs for the details on how to set that up.

Periodically moving query results from Redshift to S3 bucket

I have my data in a table in Redshift cluster. I want to periodically run a query against the Redshift table and store the results in a S3 bucket.
I will be running some data transformations on this data in the S3 bucket to feed into another system. As per AWS documentation I can use the UNLOAD command, but is there a way to schedule this periodically? I have searched a lot but I haven't found any relevant information around this.
You can use a scheduling tool like Airflow to accomplish this task. Airflow seem-lessly connects to Redshift and S3. You can have a DAG action, which polls Redshift periodically and unloads the data from Redshift onto S3.
I don't believe Redshift has the ability to schedule queries periodically. You would need to use another service for this. You could use a Lambda function, or you could schedule a cron job on an EC2 instance.
I believe you are looking for AWS data pipeline service.
You can copy data from redshift to s3 using the RedshiftCopyActivity (http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-redshiftcopyactivity.html).
I am copying the relevant content from the above URL for future purposes:
"You can also copy from Amazon Redshift to Amazon S3 using RedshiftCopyActivity. For more information, see S3DataNode.
You can use SqlActivity to perform SQL queries on the data that you've loaded into Amazon Redshift."
Let me know if this helped.
You should try AWS Data Pipelines. You can schedule them to run periodically or on demand. I am confident that it would solve your use case