aws glue triggering job - amazon-web-services

I have modified a Glue generated script that I use for transformation and manipulation with the data. I want to run the same job by trigger on every new table that appears in the catalog but without manually changing the table name in the job script.
In short, how can I run the same transformation the script provides on every new table that appears in the data catalog without manually changing the table name every time ?
Thanks

You can use Catalog Client to dynamically get the list of tables in the database. I don't know how to get the catalog client in pyspark, but in scala it looks like this
val catalog = glueContext.getCatalogClient
for (table <- catalog.listTables("myDatabaseName", "").getTableList.asScala) {
// do your transformation
}

Related

Aws Glue Crawler is not updating the table after 1st crawl

I am adding a new file in parquet format which is created by a Glue Databrew in my S3 folder. The new file has the same schema as the previous file. But when I am running the Crawler for the 2nd time it is neither updating the table nor creating a new one in the data catalog. Also when I am crawling both the files together, both of them are getting added.
Log File is giving the following information:
INFO : Created partitions with values [[New file name]] for table
BENCHMARK : Finished writing to Catalog
I have tried with and without "Create a single schema for each S3 path". But the crawler is not updating the table with the new file. Sooner I will add new files on a daily basis to do my analysis. Any solution?
The best way to approach this issue in my opinion is to use AWS DataBrew output to Data Catalog directly. Data Catalog can be updated either by the crawler or by DataBrew directly but the recommended practice is that you employ any one of those mechanisms not both.
Can you try running the job with output as your data catalog and let Databrew manage your catalog? It should update your catalog table with right data/files.

Automate loading data from S3 to Redshift

I want too load data from S3 to Redshift. The data coming to S3 in around 5MB{approximate size} per sec.
I need to automate the loading of data from S3 to Redshift.
The data to S3 is dumping from the kafka-stream consumer application.
The folder S3 data is in folder structure.
Example folder :
bucketName/abc-event/2020/9/15/10
files in this folder :
abc-event-2020-9-15-10-00-01-abxwdhf. 5MB
abc-event-2020-9-15-10-00-02-aasdljc. 5MB
abc-event-2020-9-15-10-00-03-thntsfv. 5MB
the files in S3 have json objects separated with next line.
This data need to be loaded to abc-event table in redshift.
I know few options like AWS Data pipeline, AWS Glue, AWS Lambda Redshift loader (https://aws.amazon.com/blogs/big-data/a-zero-administration-amazon-redshift-database-loader/).
What would be the best way to do it.
Really appreciate if someone will guide me.
Thanks you
=============================================
Thanks Prabhakar for the answer. Need some help in continuation on this.
Created a table in Data Catalog by crawler and
then running a ETLL job in glue does the job of loading the data from S3 to redshift.
I am using approach 1. Predicate pushdown
New files get loaded in S3 in different partition say (new hour started.)
I am adding new partition using a AWS Glue python script job.
Adding new partition in the table using Athena API. (using ALTER TABLE ADD PARTITION).
I have checked in the console that the new partition gets added by the python script job. I checked new partion gets added in Data catalog table.
When I run the same job with pushdown predicate giving same partition added by the python script glue job.
The job did not load the new files from S3 in this new partition to Redshift.
I cant figure out what I am doing wrong ???
In your use case you can leverage AWS Glue to load the data periodically into redshift.You can schedule your Glue job using trigger to run every 60 minutes which will calculate to be around 1.8 GB in your case.
This interval can be changed according to your needs and depending on how much data that you want to process each run.
There are couple of approaches you can follow in reading this data :
Predicate pushdown :
This will only load the partitions that mentioned in the job. You can calculate the partition values every run on the fly and pass them to the filter. For this you need to run Glue crawler each run so that the table partitions are updated in the table metadata.
If you don't want to use crawler then you can either use boto3 create_partition or Athena add partition which will be a free operation.
Job bookmark :
This will load only the latest s3 data that is accumulated from the time that your Glue job completed it's previous run.This approach might not be effective if there is no data generated in S3 in some runs.
Once you calculate the data that is to be read you can simply write it to redshift table every run.
In your case you have files present in sub directories for which you need to enable recurse as shown in below statement.
datasource0 = glueContext.create_dynamic_frame.from_catalog(database =<name>, table_name = <name>, push_down_predicate = "(year=='<2019>' and month=='<06>')", transformation_ctx = "datasource0", additional_options = {"recurse": True})

How data retrieved from metadata created tables in Glue Script

In AWS Glue, Although I read documentation, but I didn't get cleared one thing. Below is what I understood.
Regarding Crawlers: This will create a metadata table for either S3 or DynamoDB table. But what I don't understand is: how does Scala/Python script able to retrieve data from Actual Source (say DynamoDB or S3) using Metadata created tables.
val input = glueContext
.getCatalogSource(database = "my_data_base", tableName = "my_table")
.getDynamicFrame()
Does above line retrieve data from actual source via metadata tables?
I will be glad if someone can able to explain me behind the scenes of retrieving data in Glue script via metadata tables.
When you run a Glue crawler it will fetch metadata from S3 or JDBC (depends on your requirement) and creates tables in AWS Glue Data Catalog.
Now if you want to connect to this data/tables from Glue ETL job then you can do it in multiple ways depending on your requirement:
[from_options][1] : if you want to load directly from S3/JDBC with out connecting to Glue catalog.
[from_catalog][1] : If you want to load data from Glue catalog then you need to link it with catalog using getCatalogSource method as shown in your code. As the name infers it will use Glue data catalog as source and load particular table that you pass to this method.
Once it looks at your table definition which is pointed to a location then it will make a connection and load the data present in the source.
Yes you need to use getCatalogSource if you want to load tables from Glue catalog.
Does Catalog look into Crawler and refer to actual source and load data?
Check out the diagram in this [link][2] . It will give you an idea about the flow.
What if crawler deleted before I run getCatalogSource, then will I can able to load data in this case?
Crawler and Table are two different components. It all depends on when the table is deleted. If you delete the table after your job start to execute then there will not be any problem. If you delete it before execution starts then you will encounter an error.
What if my Source has lots of million of records? then will this load all records or how in this case?
It is good to have large files to be present in source so it will avoid most of the small files problem. Glue based on Spark and it will read files which can be fit in memory and then do the computations. Check this [answer][3] and [this][4] for best practices while reading larger files in AWS Glue.
[1]: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame-reader.html
[2]: https://docs.aws.amazon.com/athena/latest/ug/glue-athena.html
[3]: https://stackoverflow.com/questions/46638901/how-spark-read-a-large-file-petabyte-when-file-can-not-be-fit-in-sparks-main
[4]: https://aws.amazon.com/blogs/big-data/optimize-memory-management-in-aws-glue/#:~:text=Incremental%20processing:%20Processing%20large%20datasets

Create tables in Glue Data Catalog for data in S3 and unknown schema

My current use case is, in an ETL based service (NOTE: The ETL service is not using the Glue ETL, it is an independent service), I am getting some data from AWS Redshift clusters into the S3. The data in S3 is then fed into the T and L jobs. I want to populate the metadata into the Glue Catalog. The most basic solution for this is to use the Glue Crawler, but the crawler runs for approximately 1 hour and 20 mins(lot of s3 partitions). The other solution that I came across is to use Glue API's. However, I am facing the issue of data type definition in the same.
Is there any way, I can create/update the Glue Catalog Tables where I have data in S3 and the data types are known only during the extraction process.
But also, when the T and L jobs are being run, the data types should be readily available in the catalog.
In order to create, update the data catalog during your ETL process, you can make use of the following:
Update:
additionalOptions = {"enableUpdateCatalog": True, "updateBehavior": "UPDATE_IN_DATABASE"}
additionalOptions["partitionKeys"] = ["partition_key0", "partition_key1"]
sink = glueContext.write_dynamic_frame_from_catalog(frame=last_transform, database=<dst_db_name>,
table_name=<dst_tbl_name>, transformation_ctx="write_sink",
additional_options=additionalOptions)
job.commit()
The above can be used to update the schema. You also have the option to set the updateBehavior choosing between LOG or UPDATE_IN_DATABASE (default).
Create
To create new tables in the data catalog during your ETL you can follow this example:
sink = glueContext.getSink(connection_type="s3", path="s3://path/to/data",
enableUpdateCatalog=True, updateBehavior="UPDATE_IN_DATABASE",
partitionKeys=["partition_key0", "partition_key1"])
sink.setFormat("<format>")
sink.setCatalogInfo(catalogDatabase=<dst_db_name>, catalogTableName=<dst_tbl_name>)
sink.writeFrame(last_transform)
You can specify the database and new table name using setCatalogInfo.
You also have the option to update the partitions in the data catalog using the enableUpdateCatalog argument then specifying the partitionKeys.
A more detailed explanation on the functionality can be found here.
Found a solution to the problem, I ended up utilising the Glue Catalog API's to make it seamless and fast.
I created an interface which interacts with the Glue Catalog, and override those methods for various data sources. Right after the data has been loaded into the S3, I fire the query to get the schema from the source and then the interface does its work.

AWS Glue: Do I really need a Crawler for new content?

What I understand from the AWS Glue docs is a craweler will help crawl and discover new data. However, I noticed that once I crawled once, if new data goes into S3, the data is actually already discovered when I query the data catalog from Athena for example. So, can I say I do not need a crawler to crawl everytime new data is added, unless there are new schemas?
In fact, if I know the schema of the files, I can just manually create the table and do without a crawler, am I correct?
If data is partitioned by some keys (placed in sub-folders, like /data/year=2018/month=11/day=2) then you need a crawler to register newly added partitions (ie. /day=3) in Data Catalog to be able to query it via Athena.
However, if data is not partitined or comes into already registered partitions then there is no need to run a crawler.
Alternatively to runnig a crawler you can discover and register new partitions by running Athena command MSCK REPAIR TABLE <table> or registering them manually.
The easiest way to create a table in Data Catalog is running a crawler. But if you know schema and have patience to compose CREATE TABLE Athena query or fill all fields via AWS Glue console then you can go that way as well.
If you have the schema then you don't need to use the crawler and you might get better results (the crawler assumes partition columns are strings for example).
As Yuriy says, remember to run MSCK REPAIR TABLE or register new partitions manually.
MSCK can time out if you've added a lot of partitions. If it does, keep running it until it completes normally.