AWS Glue Bookmark produces duplicates - amazon-web-services

I am submitting a Python script (pyspark actually) to a Glue Job to process parquet files and extract some analytics from this data source.
These parquet files live on an S3 folder and continuously increase with new data. I was happy with the logic of bookmarking provided by AWS Glue because it helps a lot: basically allows us to process only new data without reprocessing already processed data.
Unfortunately in this scenario I notice instead that each time duplicates are produced and looks like that AWS Glue bookmarking is not working at all. What's the reason of this unexpected behaviour?

Can you please check now. It supports Parquet and ORC. But Version 1.0 and later. Version Version 0.9, it was not supporting
https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html

From https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html
The Apache Parquet and ORC formats are currently not supported.
UPDATE
Since Jul 26 2019 AWS Glue supports Parquet and ORC formats as well for bookmarking
https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html

Related

parquet files protection when appending

I have a problem when I try to do ETL on large bunch of files on AWS.
The goal is to convert JSON files to parquet files. due to the size of the files I have to do it batch by batch . Let's say I need to do it in 15 batches , i.e. 15 separate runs to be able to convert all of them.
I am using write.mode("append").format("parquet") to write into parquet files in each glue pyspark job to do that.
My problem is if one job failed for some reason then I don't know what to do - some partitions are updated while some are not, some files in the batch have been processed while some have not. for example if my 9th job failed, I am kind of stuck. I dont want to delete all parquet files to start over, but also dont want to just re-run that 9th job and cause duplicates.
Is there a way to protect parquet files to only append new files into them if the whole job is successful?
THank you!!
Based on your comment and a similar experience I had with this problem, I believe this happens because of S3 eventual consistency. Have a look at Amazon S3 Data Consistency Model here https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html.
We found that using partitioned staging s3a committer with the conflict resolution mode replace made our jobs not fail.
Try the following parameters with your spark jobs:
spark.hadoop.fs.s3a.committer.staging.conflict-mode replace
spark.hadoop.fs.s3a.committer.name partitioned
Also have a read about the committers here:
https://hadoop.apache.org/docs/r3.1.1/hadoop-aws/tools/hadoop-aws/committers.html
Hope this helps!
P.S. If all fails and our files are not too big, you can do a hacky solution where you save your parquet file locally and upload when your spark tasks are complete, but I personally do not recommend.

Index large DynamoDB table into Cloudsearch via AWS CLI

I'm currently facing an issue on CloudSearch when trying to index a large DynamoDB table via AWS Console:
Retrieving a subset of the table...
The request took too long to complete. Please try again or use the command line tools.
After looking throught the documentations[1, 2, 3], there are examples of uploading several file formats thought the CLI but no mention of uploading data from a DynamoDB table using the CLI.
How this could be done without having to download the entire database in a file and uploading it?
I've got in contact with AWS Support and there's not way of indexing DynamoDB data using the CLI.
The only available way is downloading the data and uploading though the CLI using a classical file format/structure(.csv, .json, .xml and .txt).
A shame, unfortunately.

Processing unpartitioned data with AWS Glue using bookmarking

I have data being written from Kafka to a directory in s3 with a structure like this:
s3://bucket/topics/topic1/files1...N
s3://bucket/topics/topic2/files1...N
.
.
s3://bucket/topics/topicN/files1...N
There is already a lot of data in this bucket and I want to use AWS Glue to transform it into parquet and partition it, but there is way too much data to do it all at once. I was looking into bookmarking and it seems like you can't use it to only read the most recent data or to process data in chunks. Is there a recommended way of processing data like this so that bookmarking will work for when new data comes in?
Also, does bookmarking require that spark or glue has to scan my entire dataset each time I run a job to figure out which files are greater than the last runs max_last_modified timestamp? That seems pretty inefficient especially as the data in the source bucket continues to grow.
I have learned that Glue wants all similar files (files with same structure and purpose) to be under one folder, with optional subfolders.
s3://my-bucket/report-type-a/yyyy/mm/dd/file1.txt
s3://my-bucket/report-type-a/yyyy/mm/dd/file2.txt
...
s3://my-bucket/report-type-b/yyyy/mm/dd/file23.txt
All of the files under report-type-a folder must be of the same format. Put a different report like report-type-b in a different folder.
You might try putting just a few of your input files in the proper place, running your ETL job, placing more files in the bucket, running again, etc.
I tried this by getting the current files working (one file per day), then back-filling historical files. Note however, that this did not work completely. I have been getting files processed ok in s3://my-bucket/report-type/2019/07/report_20190722.gzp, but when I tried to add past files to 's3://my-bucket/report-type/2019/05/report_20190510.gzip`, Glue did not "see" or process the file in the older folder.
However, if I moved the old report to the current partition, it worked: s3://my-bucket/report-type/2019/07/report_20190510.gzip .
AWS Glue bookmarking works only with a select few formats (more here) and when read using glueContext.create_dynamic_frame.from_options function. Along with this job.init() and job.commit() should also be present in the glue script. You can checkout a related answer.

How do I import JSON data from S3 using AWS Glue?

I have a whole bunch of data in AWS S3 stored in JSON format. It looks like this:
s3://my-bucket/store-1/20190101/sales.json
s3://my-bucket/store-1/20190102/sales.json
s3://my-bucket/store-1/20190103/sales.json
s3://my-bucket/store-1/20190104/sales.json
...
s3://my-bucket/store-2/20190101/sales.json
s3://my-bucket/store-2/20190102/sales.json
s3://my-bucket/store-2/20190103/sales.json
s3://my-bucket/store-2/20190104/sales.json
...
It's all the same schema. I want to get all that JSON data into a single database table. I can't find a good tutorial that explains how to set this up.
Ideally, I would also be able to perform small "normalization" transformations on some columns, too.
I assume Glue is the right choice, but I am open to other options!
If you need to process data using Glue and there is no need to have a table registered in Glue Catalog then there is no need to run Glue Crawler. You can setup a job and use getSourceWithFormat() with recurse option set to true and paths pointing to the root folder (in your case it's ["s3://my-bucket/"] or ["s3://my-bucket/store-1", "s3://my-bucket/store-2", ...]). In the job you can also apply any required transformations and then write the result into another S3 bucket, relational DB or a Glue Catalog.
Yes, Glue is a great tool for this!
Use a crawler to create a table in the glue data catalog (remember to set Create a single schema for each S3 path under Grouping behavior for S3 data when creating the crawler)
Read more about it here
Then you can use relationalize to flatten our your json structure, read more about that here
Json and AWS Glue may not be the best match. Since AWS Glue is based on hadoop, it inherits hadoop's "one-row-per-newline" restriction, so even if your data is in json, it has to be formatted with one json object per line [1]. Since you'll be pre-processing your data anyway to get it into this line-separated format, it may be easier to use csv instead of json.
Edit 2022-11-29: There does appear to be some tooling now for jsonl, which is the actual format that AWS expects, making this less of an automatic win for csv. I would say if your data is already in json format, it's probably smarter to convert it to jsonl than to convert to csv.

AWS Glue - Avro snappy compression read error - HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split

After saving Avro files with snappy compression (also same error with gzip/bzip2 compression) in S3 using AWS Glue, when I try to read the data in athena using AWS Crawler, I get the following error - HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split - using org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat: Not a data file. Any idea why I get this error and how to resolve this?
Thank you.
Circumvented this issue by attaching native spark avro jar file to the glue job during execution and using native spark read/write methods to write them in avro format and for the compression setting spark.conf.set("spark.sql.avro.compression.codec","snappy") as soon as the spark session is created.
Works perfectly for me and could be read via Athena as well.
AWS Glue doesn't support writing avro with compression files even though it's not stated clearly in docs. A job succeeds but it applies compressions in a wrong way: instead of compressing file blocks it compresses entire file that is wrong and that's the reason why Athena can't query it.
There are plans to fix the issue but I don't know ETA.
It would be nice if you could contact AWS support to let them know that you are having this issue too (more customers affected - sooner fixed)