I'm new to using AWS Glue and I don't understand how the ETL job gathers the data. I used a crawler to generate my table schema from some files in an S3 bucket and examined the autogenerated script in the ETL job, which is here (slightly modified):
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "mydatabase", table_name = "mytablename", transformation_ctx = "datasource0")
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("data", "string", "data", "string")], transformation_ctx = "applymapping1")
datasink2 = glueContext.write_dynamic_frame.from_options(frame = applymapping1, connection_type = "s3", connection_options = {"path": "s3://myoutputbucket"}, format = "json", transformation_ctx = "datasink2")
When I run this job, it successfully takes my data from the bucket that my crawler used to generate the table schema and it puts the data into my destination s3 bucket as expected.
My question is this: I don't see anywhere in this script where the data is "loaded", so to speak. I know I point it at the table that was generated by the crawler, but from this doc:
Tables and databases in AWS Glue are objects in the AWS Glue Data Catalog. They contain metadata; they don't contain data from a data store.
If the table only contains metadata, how are the files from the data store (in my case, an S3 bucket) retrieved by the ETL job? I'm asking primarily because I'd like to somehow modify the ETL job to transform identically structured files in a different bucket without having to write a new crawler, but also because I'd like to strengthen my general understanding of the Glue service.
The main thing to understand is:
Glue datasource catalog (datebasess and tables) are always in sync with Athena,which is serverless query service that makes it easy to analyze data in Amazon S3 using standard SQL. You can either create tables/databases from Glue Console / Athena Query console.
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "mydatabase", table_name = "mytablename", transformation_ctx = "datasource0")
This above line of Glue Spark code is doing the magic for you in creating the initial dataframe using Glue data catalog source table, apart from the metadata, schema and table properties it also have the Location pointed to your Data Store (s3 location), where your data resides.
after applymapping has been done, this portion (datasink) of code is doing the actual loading of data into your target cluster/database.
datasink2 = glueContext.write_dynamic_frame.from_options(frame = applymapping1, connection_type = "s3", connection_options = {"path": "s3://myoutputbucket"}, format = "json", transformation_ctx = "datasink2")
If you drill down deep into the AWS Glue Data Catalog. It has tables residing under the databases. By clicking on these tables you get exposed to the metadata which shows which the s3 folder where the the current table is being pointed towards as a result of the crawler run.
You can still create tables over an s3 structured file manually by adding tables via data catalog option:
and pointing it to your s3 location.
Another way is to use AWS-athena console to create tables pointing s3 locations. You would be using a regular create table script with the location field holding your s3 location.
Related
I have created a glue ETL job, which is loading data from file present at s3 location into Redshift table. However my need is to load the data into a particular redshift schema, below is the sample code from my glue script:
S3bucket_node1 = glueContext.create_dynamic_frame.from_options(
format_options={"withHeader": True, "separator": "~"},
connection_type="s3",
format="csv",
connection_options={"paths": ["s3://<bucketName>/{0}".format(file_name_temp)]},
transformation_ctx="S3bucket_node1",
)
print('{0} completed'.format(file_name))
print(S3bucket_node1)
datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = S3bucket_node1, catalog_connection = "redshift", connection_options = {"dbtable": "Persons", "database": "dev"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink4")
With above script, it always load data into Public schema. When I try to define DBname as "Schema.tableName" it gives an error. I tried to add a new property as "Schema" in connection_options that did not worked as well. Can someone please help, how can I define the schema name above to load data into that particular schema only. I knw how it can be done by creating the crawler, but in my project we are restricted to create crawlers.
TL;DR
I'm trying to consolidate many S3 data-files into a fewer number using a Glue [Studio] job
Input data is Catalogued in Glue and queryable via Athena
Glue Job runs with "Succeeded" output status, but no output files are created
Details
Input I have data that's being created from a scraper on a once-per-minute cycle. It's dumping the output in JSON (gzip) format to a bucket. I have this bucket catalogued in Glue and can query against it, with no errors, using Athena. This makes me feel more confident that I have the Catalogue and data-structure set up correctly. Alone, this isn't ideal as it creates ~1.4K files per day, which makes the queries against the data (via Athena) quite slow as they have to scan way too many, far too small files
Goal I'd like to periodically (probably once per week, month, I'm not sure yet) consolidate the once-per-minute files into far fewer, so that queries are scanning bigger and less numerous files (faster queries).
Approach My plan is to create a Glue ETL job (using Glue Studio) to read from the Catalogue Table, and write to a new S3 location (maintaining the same JSON-gzip format, so I can just re-point the Glue table to the new S3 location with the consolidated files). I set up the job using Glue Studio, and when I run it it says is succeeded, but there's no output to the S3 location specified (not empty files, just nothing at all).
Stuck! I'm at a bit of a loss, since (1) it says it's succeeding, and (2) I'm not even modifying the script (see below), so I'd presume (maybe a bad idea) that it's not that.
Logs I've tried going through the CloudWatch logs to see if it'll help, but I don't get much out of there. I suspect it may have something to do with this entry, but I can't find a way to either confirm that or change anything to "fix" it. (The path definitely exists, verified by the fact that I can see it in S3, the Catalogue can search it as verified by Athena queries, and it's auto-generated by the Glue Studio script-builder.) To me it sounds like I've selected, somewhere, an option that makes it think I only want some sort of "incremental" scan of the data. But I haven't (knowingly), nor can I find anywhere that would make it seem I have.
CloudWatch Log Entry
21/03/13 17:59:39 WARN HadoopDataSource: Skipping Partition {} as no new files detected # s3://my_bucket/my_folder/my_source_data/ or path does not exist
Glue Script
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
## #type: DataSource
## #args: [database = "my_database", table_name = "my_table", transformation_ctx = "DataSource0"]
## #return: DataSource0
## #inputs: []
DataSource0 = glueContext.create_dynamic_frame.from_catalog(database = "my_database", table_name = "my_table", transformation_ctx = "DataSource0")
## #type: DataSink
## #args: [connection_type = "s3", format = "json", connection_options = {"path": "s3://my_bucket/my_folder/consolidation/", "compression": "gzip", "partitionKeys": []}, transformation_ctx = "DataSink0"]
## #return: DataSink0
## #inputs: [frame = DataSource0]
DataSink0 = glueContext.write_dynamic_frame.from_options(frame = DataSource0, connection_type = "s3", format = "json", connection_options = {"path": "s3://my_bucket/my_folder/consolidation/", "compression": "gzip", "partitionKeys": []}, transformation_ctx = "DataSink0")
job.commit()
Other Posts I Researched First
None have the same problem of a "Succeeded" job providing no output. However, one had empty files being created, while another too many files. The most interesting approach was using Athena to create the new output file for you (with an external table); however, when I looked into that, it appeared that the output format options would not have JSON-gzip (or JSON without gzip), but only CSV and Parquet, which are non-preferred for my use.
How to Convert Many CSV files to Parquet using AWS Glue
AWS Glue: ETL job creates many empty output files
AWS Glue Job - Writing into single Parquet file
AWS Glue, output one file with partitions
datasource_df = DataSource0.repartition(1)
DataSink0 = glueContext.write_dynamic_frame.from_options(frame = datasource_df, connection_type = "s3", format = "json", connection_options = {"path": "s3://my_bucket/my_folder/consolidation/", "compression": "gzip", "partitionKeys": []}, transformation_ctx = "DataSink0")
job.commit()
I'm relatively new to DataLakes and Im going through some research for a project on AWS.
I have created a DataLake and have tables generated from Glue Crawlers, I can see the data in S3 and query it using Athena. So far so good.
There is a requirement to transform parts of the data stored in the datalake to RDS for applications to read the data. What is the best solution for ETL from S3 DataLake to RDS?
Most posts I've come across talk about ETL from RDS to S3 and not the other way around.
By creating a Glue Job using the Spark job type I was able to use my S3 table as a data source and an Aurora/MariaDB as the destination.
Trying the same with a python job type didn't allow me to view any S3 tables during the Glue Job Wizard screens.
Once the data is in Glue DataFrame of Spark DataFrame, wrinting it out is pretty much straight forward. Use RDBMS as data sink.
For example, to write to a Redshift DB,
// Write data to staging table in Redshift
glueContext.getJDBCSink(
catalogConnection = "redshift-glue-connections-test",
options = JsonOptions(Map(
"database" -> "conndb",
"dbtable" -> staging,
"overwrite" -> "true",
"preactions" -> "<another SQL queries>",
"postactions" -> "<some SQL queries>"
)),
redshiftTmpDir = tempDir,
transformationContext = "redshift-output"
).writeDynamicFrame(datasetDf)
As shown above, use the JDBC Connection you've created to write the data to.
You can accomplish that with a Glue Job. Sample code:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext, SparkConf
from awsglue.context import GlueContext
from awsglue.job import Job
import time
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
file_paths = ['path']
df = glueContext.create_dynamic_frame_from_options("s3", {'paths': file_paths}, format="csv", format_options={"separator": ",", "quoteChar": '"', "withHeader": True})
df.printSchema()
df.show(10)
options = {
'user': 'usr',
'password': 'pwd',
'url': 'url',
'dbtable': 'tabl'}
glueContext.write_from_options(frame_or_dfc=df, connection_type="mysql", connection_options=options)
We receive 1 csv file everyday in s3 bucket from our vendor at 11am.
I convert this file into parquet format using Glue at 11:30am.
I've enabled job bookmark to not process already processed files.
Nonetheless, I see some files are being reprocessed thus creating duplicates.
I read these questions and answers AWS Glue Bookmark produces duplicates for PARQUET and AWS Glue Job Bookmarking explanation
They gave good understanding of job bookmarking, but still do not address the issue.
AWS documentation says, it supports CSV files for bookmarking AWS documentation.
Wondering if someone help me understand what could be the problem and if possible solution as well :)
Edit:
Pasting sample code here as requested by Prabhakar.
staging_database_name = "my-glue-db"
s3_target_path = "s3://mybucket/mydata/"
"""
'date_index': date location in the file name
'date_only': only date column is inserted
'date_format': format of date
'path': sub folder name in master bucket
"""
#fouo classified files
tables_spec = {
'sample_table': {'path': 'sample_table/load_date=','pkey': 'mykey', 'orderkey':'myorderkey'}
}
spark_conf = SparkConf().setAll([
("spark.hadoop.fs.s3.enableServerSideEncryption", "true"),
("spark.hadoop.fs.s3.serverSideEncryption.kms.keyId", kms_key_id)
])
sc = SparkContext(conf=spark_conf)
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
for table_name, spec in tables_spec.items():
datasource0 = glueContext.create_dynamic_frame.from_catalog(database=database_name,
table_name=table_name,
transformation_ctx='datasource0')
resolvechoice2 = ResolveChoice.apply(frame=datasource0, choice="make_struct", transformation_ctx='resolvechoice2')
# Create spark data frame with input_file_name column
delta_df = resolvechoice2.toDF().withColumn('ingest_datetime', lit(str(ingest_datetime)))
date_dyf = DynamicFrame.fromDF(delta_df, glueContext, "date_dyf")
master_folder_path1 = os.path.join(s3_target_path, spec['path']).replace('\\', '/')
master_folder_path=master_folder_path1+load_date
datasink4 = glueContext.write_dynamic_frame.from_options(frame=date_dyf,
connection_type='s3',
connection_options={"path": master_folder_path},
format='parquet', transformation_ctx='datasink4')
job.commit()
Spoke to AWS Support engineer and she mentioned that, she is able to reproduce the issue and have raised it with Glue technical team for resolution.
Nonetheless, I couldn't wait on them fixing the bug and have taken different approach.
Solution:
Disable Glue bookmark
After Glue job converts csv file to Parquet, I
move csv file to different location in S3 bucket.
If you follow the AWS Glue Add Job Wizard to create a script to write parquet files to S3 you end up with generated code something like this.
datasink4 = glueContext.write_dynamic_frame.from_options(
frame=dropnullfields3,
connection_type="s3",
connection_options={"path": "s3://my-s3-bucket/datafile.parquet"},
format="parquet",
transformation_ctx="datasink4",
)
Is it possible to specify a KMS key so that the data is encrypted in the bucket?
glue scala job
val spark: SparkContext = new SparkContext()
val glueContext: GlueContext = new GlueContext(spark)
spark.hadoopConfiguration.set("fs.s3.enableServerSideEncryption", "true")
spark.hadoopConfiguration.set("fs.s3.serverSideEncryption.kms.keyId", args("ENCRYPTION_KEY"))
I think syntax should be differ for Python, but idea the same
To spell out the answer using PySpark, you can do either
from pyspark.conf import SparkConf
[...]
spark_conf = SparkConf().setAll([
("spark.hadoop.fs.s3.enableServerSideEncryption", "true"),
("spark.hadoop.fs.s3.serverSideEncryption.kms.keyId", "<Your Key ID>")
])
sc = SparkContext(conf=spark_conf)
noticing the spark.hadoop prefix - or (uglier but shorter)
sc._jsc.hadoopConfiguration().set("fs.s3.enableServerSideEncryption", "true")
sc._jsc.hadoopConfiguration().set("fs.s3.serverSideEncryption.kms.keyId", "<Your Key ID>")
where sc is your current SparkContext.
This isn't necessary. Perhaps it was when the question was first posed, but the same can be achieved by creating a security configuration and associating that with the glue job. Just remember to have this in your script, otherwise it won't do it:
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
https://docs.aws.amazon.com/glue/latest/dg/encryption-security-configuration.html
https://docs.aws.amazon.com/glue/latest/dg/set-up-encryption.html