How to read the traceId from a AWS Glue job? - amazon-web-services

Is it possible to read the xray trace id from an AWS Glue job? in lambda it can be fetched from an env variable but im not sure about glue.

Related

AWS glue job (Pyspark) to AWS glue data catalog

We know that,
the procedure of writing from pyspark script (aws glue job) to AWS data catalog is to write in s3 bucket (eg.csv) use a crawler and schedule it.
Is there any other way of writing to aws glue data catalog?
I am looking for a direct way to do this.Eg. writing as a s3 file and sync to the aws glue data catalog.
You may manually specify the table. The crawler only discovers the schema. If you set the schema manually, you should be able to read your data when you run the AWS Glue Job.
We have had this same problem for one of our customers who had millions of small files within AWS S3. The crawler practically would stall and not proceed and continue to run infinitely. We came up with the following alternative approach :
A Custom Glue Python Shell job was written which leveraged AWS Wrangler to fire queries towards AWS Athena.
The Python Shell job would List the contents of folder s3:///event_date=<Put the Date Here from #2.1>
The queries fired :
alter table add partition (event_date='<event_date from above>',eventname=’List derived from above S3 List output’)
4. This was triggered to run post the main Ingestion Job via Glue Workflows.
If you are not expecting schema to change, use Glue job directly after creating manually tables using Glue Database and Table.

Can we use AWS glue for analysing the RDS database and store the analysed data into rds mysql table using ETL

I am new in AWS. I want to use AWS glue for ETL process.
Could we use AWS glue for analyzing the RDS database and store the analyzed data into rds mysql table using ETL job
Thanks
Yes, its possible. We have used S3 to store our raw data, from where we read the data in AWS Glue, and perform UPSERTs to RDS Aurora as part of our ETL process. You can either use AWS Glue trigger or a Lambda S3 event triggers for calling the glue job.
We have used pymysql / mysql.connector in AWS Glue since we have to do UPSERTs. Bulk load data directly from S3 is also supported for RDS Mysql (Aurora). Let me know if you need help with code sample

AWS Glue Data Catalog, temporary tables and Apache Spark createOrReplaceTempView

According to AWS Glue Data Catalog documentation https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html
Temporary tables are not supported.
It is not clear to me or under Temporary tables I can also consider the Temporary views that can be created in Apache Spark via DataFrame.createOrReplaceTempView method?
So, in other words - I can't use DataFrame.createOrReplaceTempView method with AWS Glue and AWS Glue Data Catalog, am I right? I can only operate with permanent tables/view with AWS Glue and AWS Glue Data Catalog right now and must use AWS EMR cluster for full-featured Apache spark functionality?
You can use DataFrame.createOrReplaceTempView() in AWS Glue. You have to convert dynamicframe to dataframe using toDF().
But these views will remain in scope of your current glue job instance and won't be accessible from other glue jobs or other instances of same job or athena

Aws data pipeline trigger aws glue crawler

I have an Aws Data Pipeline with an EMR Activity, which writes data on S3. At the end of this process, it also writes some metadata to a specific S3 folder in that location.
Is there a way to trigger an Aws Glue crawler from within a Data Pipelines definition - which scans this last S3 location, so that it creates an Aws Athena table?
I haven't found a way to do this looking in the Aws Data Pipelines documentation.
Maybe you could use ShellCommandActivity and call aws glue start-crawler.

Variable to provide iteration number for Hive Query reading from Kinesis stream in an AWS Datapipeline

I am trying to create an AWS Datapipeline that excutes a Hive Query whose output is written to an S3 Bucket. The data is then moved from the S3 bucket in an AWS Redshift Cluster.
The Hive Query is using a Kinesis Stream as its input. I'm trying to leverage Kinesis' checkpointing capability and want to pass a variable to my ShellCommandActivity so I can set the iteration number in the hive script. Is there anyway for me to have a variable that increments by 1 every time the datapipeline is run?
Any assistance would be great!