AWS Glue job bookmark produces duplicates for csv files - amazon-web-services

We receive 1 csv file everyday in s3 bucket from our vendor at 11am.
I convert this file into parquet format using Glue at 11:30am.
I've enabled job bookmark to not process already processed files.
Nonetheless, I see some files are being reprocessed thus creating duplicates.
I read these questions and answers AWS Glue Bookmark produces duplicates for PARQUET and AWS Glue Job Bookmarking explanation
They gave good understanding of job bookmarking, but still do not address the issue.
AWS documentation says, it supports CSV files for bookmarking AWS documentation.
Wondering if someone help me understand what could be the problem and if possible solution as well :)
Edit:
Pasting sample code here as requested by Prabhakar.
staging_database_name = "my-glue-db"
s3_target_path = "s3://mybucket/mydata/"
"""
'date_index': date location in the file name
'date_only': only date column is inserted
'date_format': format of date
'path': sub folder name in master bucket
"""
#fouo classified files
tables_spec = {
'sample_table': {'path': 'sample_table/load_date=','pkey': 'mykey', 'orderkey':'myorderkey'}
}
spark_conf = SparkConf().setAll([
("spark.hadoop.fs.s3.enableServerSideEncryption", "true"),
("spark.hadoop.fs.s3.serverSideEncryption.kms.keyId", kms_key_id)
])
sc = SparkContext(conf=spark_conf)
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
for table_name, spec in tables_spec.items():
datasource0 = glueContext.create_dynamic_frame.from_catalog(database=database_name,
table_name=table_name,
transformation_ctx='datasource0')
resolvechoice2 = ResolveChoice.apply(frame=datasource0, choice="make_struct", transformation_ctx='resolvechoice2')
# Create spark data frame with input_file_name column
delta_df = resolvechoice2.toDF().withColumn('ingest_datetime', lit(str(ingest_datetime)))
date_dyf = DynamicFrame.fromDF(delta_df, glueContext, "date_dyf")
master_folder_path1 = os.path.join(s3_target_path, spec['path']).replace('\\', '/')
master_folder_path=master_folder_path1+load_date
datasink4 = glueContext.write_dynamic_frame.from_options(frame=date_dyf,
connection_type='s3',
connection_options={"path": master_folder_path},
format='parquet', transformation_ctx='datasink4')
job.commit()

Spoke to AWS Support engineer and she mentioned that, she is able to reproduce the issue and have raised it with Glue technical team for resolution.
Nonetheless, I couldn't wait on them fixing the bug and have taken different approach.
Solution:
Disable Glue bookmark
After Glue job converts csv file to Parquet, I
move csv file to different location in S3 bucket.

Related

Consolidating Many Data Files into One Using Glue - Job Succeeds But Without Output Files

TL;DR
I'm trying to consolidate many S3 data-files into a fewer number using a Glue [Studio] job
Input data is Catalogued in Glue and queryable via Athena
Glue Job runs with "Succeeded" output status, but no output files are created
Details
Input I have data that's being created from a scraper on a once-per-minute cycle. It's dumping the output in JSON (gzip) format to a bucket. I have this bucket catalogued in Glue and can query against it, with no errors, using Athena. This makes me feel more confident that I have the Catalogue and data-structure set up correctly. Alone, this isn't ideal as it creates ~1.4K files per day, which makes the queries against the data (via Athena) quite slow as they have to scan way too many, far too small files
Goal I'd like to periodically (probably once per week, month, I'm not sure yet) consolidate the once-per-minute files into far fewer, so that queries are scanning bigger and less numerous files (faster queries).
Approach My plan is to create a Glue ETL job (using Glue Studio) to read from the Catalogue Table, and write to a new S3 location (maintaining the same JSON-gzip format, so I can just re-point the Glue table to the new S3 location with the consolidated files). I set up the job using Glue Studio, and when I run it it says is succeeded, but there's no output to the S3 location specified (not empty files, just nothing at all).
Stuck! I'm at a bit of a loss, since (1) it says it's succeeding, and (2) I'm not even modifying the script (see below), so I'd presume (maybe a bad idea) that it's not that.
Logs I've tried going through the CloudWatch logs to see if it'll help, but I don't get much out of there. I suspect it may have something to do with this entry, but I can't find a way to either confirm that or change anything to "fix" it. (The path definitely exists, verified by the fact that I can see it in S3, the Catalogue can search it as verified by Athena queries, and it's auto-generated by the Glue Studio script-builder.) To me it sounds like I've selected, somewhere, an option that makes it think I only want some sort of "incremental" scan of the data. But I haven't (knowingly), nor can I find anywhere that would make it seem I have.
CloudWatch Log Entry
21/03/13 17:59:39 WARN HadoopDataSource: Skipping Partition {} as no new files detected # s3://my_bucket/my_folder/my_source_data/ or path does not exist
Glue Script
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
## #type: DataSource
## #args: [database = "my_database", table_name = "my_table", transformation_ctx = "DataSource0"]
## #return: DataSource0
## #inputs: []
DataSource0 = glueContext.create_dynamic_frame.from_catalog(database = "my_database", table_name = "my_table", transformation_ctx = "DataSource0")
## #type: DataSink
## #args: [connection_type = "s3", format = "json", connection_options = {"path": "s3://my_bucket/my_folder/consolidation/", "compression": "gzip", "partitionKeys": []}, transformation_ctx = "DataSink0"]
## #return: DataSink0
## #inputs: [frame = DataSource0]
DataSink0 = glueContext.write_dynamic_frame.from_options(frame = DataSource0, connection_type = "s3", format = "json", connection_options = {"path": "s3://my_bucket/my_folder/consolidation/", "compression": "gzip", "partitionKeys": []}, transformation_ctx = "DataSink0")
job.commit()
Other Posts I Researched First
None have the same problem of a "Succeeded" job providing no output. However, one had empty files being created, while another too many files. The most interesting approach was using Athena to create the new output file for you (with an external table); however, when I looked into that, it appeared that the output format options would not have JSON-gzip (or JSON without gzip), but only CSV and Parquet, which are non-preferred for my use.
How to Convert Many CSV files to Parquet using AWS Glue
AWS Glue: ETL job creates many empty output files
AWS Glue Job - Writing into single Parquet file
AWS Glue, output one file with partitions
datasource_df = DataSource0.repartition(1)
DataSink0 = glueContext.write_dynamic_frame.from_options(frame = datasource_df, connection_type = "s3", format = "json", connection_options = {"path": "s3://my_bucket/my_folder/consolidation/", "compression": "gzip", "partitionKeys": []}, transformation_ctx = "DataSink0")
job.commit()

AWS Glue job create_dynamic_frame_from_options() opening a specific file?

If one uses create_dynamic_frame_from_catalog(), you supply the database name and table name, e.g. created from a Glue crawler, which effectively names a specific input file. I want to be able to do the same (name a specific input file) without the crawler and database.
I've tried using create_dynamic_frame_from_options(), but the "path" connection option doesn't allow me to name the file, apparently. Is there any way to do this?
IIUC, you want to read multiple files from a specific s3 path and want the filename in your dataframe. You can achieve this by using spark session and reading it as pyspark dataframe
from pyspark.sql.functions import input_file_name
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
path = 's3://bucket/folder'
df = spark.read.csv(path)
df = df.withColumn('FileName', input_file_name())

Change output CSV file name of AWS Athena queries

I wan to run my Athena query through AWS Lambda, but also change the name of my output CSV file from Query Execution ID to my-bucket/folder/my-preferred-string.csv
I tried searching for the results on web, but couldn't found the exact code for lambda function.
I am a data scientist and a beginner to AWS. This is a one time thing for me, so looking for a quick solution or a patch up.
This question is already posted here
client = boto3.client('athena')
s3 = boto3.resource("s3")
# Run query
queryStart = client.start_query_execution(
# PUT_YOUR_QUERY_HERE
QueryString = '''
SELECT *
FROM "db_name"."table_name"
WHERE value > 50
''',
QueryExecutionContext = {
# YOUR_ATHENA_DATABASE_NAME
'Database': "covid_data"
},
ResultConfiguration = {
# query result output location you mentioned in AWS Athena
"OutputLocation": "s3://bucket-name-X/folder-Y/"
}
)
# Executes query and waits 3 seconds
queryId = queryStart['QueryExecutionId']
time.sleep(3)
# Copies newly generated csv file with appropriate name
# query result output location you mentioned in AWS Athena
queryLoc = "bucket-name-X/folder-Y/" + queryId + ".csv"
# Destination location and file name
s3.Object("bucket-name-A", "report-2018.csv").copy_from(CopySource = queryLoc)
# Deletes Athena generated csv and it's metadata file
response = s3.delete_object(
Bucket='bucket-name-A',
Key=queryId+".csv"
)
response = s3.delete_object(
Bucket='bucket-name-A',
Key=queryId+".csv.metadata"
)
print('{file-name} csv generated')

How does AWS Glue ETL job retrieve data?

I'm new to using AWS Glue and I don't understand how the ETL job gathers the data. I used a crawler to generate my table schema from some files in an S3 bucket and examined the autogenerated script in the ETL job, which is here (slightly modified):
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "mydatabase", table_name = "mytablename", transformation_ctx = "datasource0")
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("data", "string", "data", "string")], transformation_ctx = "applymapping1")
datasink2 = glueContext.write_dynamic_frame.from_options(frame = applymapping1, connection_type = "s3", connection_options = {"path": "s3://myoutputbucket"}, format = "json", transformation_ctx = "datasink2")
When I run this job, it successfully takes my data from the bucket that my crawler used to generate the table schema and it puts the data into my destination s3 bucket as expected.
My question is this: I don't see anywhere in this script where the data is "loaded", so to speak. I know I point it at the table that was generated by the crawler, but from this doc:
Tables and databases in AWS Glue are objects in the AWS Glue Data Catalog. They contain metadata; they don't contain data from a data store.
If the table only contains metadata, how are the files from the data store (in my case, an S3 bucket) retrieved by the ETL job? I'm asking primarily because I'd like to somehow modify the ETL job to transform identically structured files in a different bucket without having to write a new crawler, but also because I'd like to strengthen my general understanding of the Glue service.
The main thing to understand is:
Glue datasource catalog (datebasess and tables) are always in sync with Athena,which is serverless query service that makes it easy to analyze data in Amazon S3 using standard SQL. You can either create tables/databases from Glue Console / Athena Query console.
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "mydatabase", table_name = "mytablename", transformation_ctx = "datasource0")
This above line of Glue Spark code is doing the magic for you in creating the initial dataframe using Glue data catalog source table, apart from the metadata, schema and table properties it also have the Location pointed to your Data Store (s3 location), where your data resides.
after applymapping has been done, this portion (datasink) of code is doing the actual loading of data into your target cluster/database.
datasink2 = glueContext.write_dynamic_frame.from_options(frame = applymapping1, connection_type = "s3", connection_options = {"path": "s3://myoutputbucket"}, format = "json", transformation_ctx = "datasink2")
If you drill down deep into the AWS Glue Data Catalog. It has tables residing under the databases. By clicking on these tables you get exposed to the metadata which shows which the s3 folder where the the current table is being pointed towards as a result of the crawler run.
You can still create tables over an s3 structured file manually by adding tables via data catalog option:
and pointing it to your s3 location.
Another way is to use AWS-athena console to create tables pointing s3 locations. You would be using a regular create table script with the location field holding your s3 location.

Can a SSE:KMS Key ID be specified when writing to S3 in an AWS Glue Job?

If you follow the AWS Glue Add Job Wizard to create a script to write parquet files to S3 you end up with generated code something like this.
datasink4 = glueContext.write_dynamic_frame.from_options(
frame=dropnullfields3,
connection_type="s3",
connection_options={"path": "s3://my-s3-bucket/datafile.parquet"},
format="parquet",
transformation_ctx="datasink4",
)
Is it possible to specify a KMS key so that the data is encrypted in the bucket?
glue scala job
val spark: SparkContext = new SparkContext()
val glueContext: GlueContext = new GlueContext(spark)
spark.hadoopConfiguration.set("fs.s3.enableServerSideEncryption", "true")
spark.hadoopConfiguration.set("fs.s3.serverSideEncryption.kms.keyId", args("ENCRYPTION_KEY"))
I think syntax should be differ for Python, but idea the same
To spell out the answer using PySpark, you can do either
from pyspark.conf import SparkConf
[...]
spark_conf = SparkConf().setAll([
("spark.hadoop.fs.s3.enableServerSideEncryption", "true"),
("spark.hadoop.fs.s3.serverSideEncryption.kms.keyId", "<Your Key ID>")
])
sc = SparkContext(conf=spark_conf)
noticing the spark.hadoop prefix - or (uglier but shorter)
sc._jsc.hadoopConfiguration().set("fs.s3.enableServerSideEncryption", "true")
sc._jsc.hadoopConfiguration().set("fs.s3.serverSideEncryption.kms.keyId", "<Your Key ID>")
where sc is your current SparkContext.
This isn't necessary. Perhaps it was when the question was first posed, but the same can be achieved by creating a security configuration and associating that with the glue job. Just remember to have this in your script, otherwise it won't do it:
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
https://docs.aws.amazon.com/glue/latest/dg/encryption-security-configuration.html
https://docs.aws.amazon.com/glue/latest/dg/set-up-encryption.html