Glue Job fails to write file - amazon-web-services

I am back filling some data via glue jobs. The job itself is reading in a TSV from s3, transforming the data slightly, and writing it in Parquet to S3. Since I already have the data, I am trying to launch multiple jobs at once to reduce the amount of time needed to process it all. When I launch multiple jobs at the same time, I run into an issue sometimes where one of the files will fail to output the resultant Parquet files in S3. The job itself completes successfully without throwing an error When I rerun the job as a non-parallel task, the file it output correctly. Is there some issue, either with glue(or the underlying spark) or S3 that would cause my issue?

The same Glue job running in parallel may produce files with the same names and therefore some of them can be overwritten. As I remember correctly, transformation-context is used as part of the name. I assume you don't have bookmarking enabled so it should be safe for you to generate transformation-context value dynamically to ensure it's unique for each job.

Related

Apache Spark - Write Parquet Files to S3 with both Dynamic Partition Overwrite and S3 Committer

I'm currently building an application with Apache Spark (pyspark), and I have the following use case:
Run pyspark with local mode (using spark-submit local[*]).
Write the results of my spark job to S3 in the form of partitioned Parquet files.
Ensure that each job overwrite the particular partition it is writing to, in order to ensure idempotent jobs.
Ensure that spark-staging files are written to local disk before being committed to S3, as staging in S3, and then committing via a rename operation, is very expensive.
For various internal reasons, all four of the above bullet points are non-negotiable.
I have everything but the last bullet point working. I'm running a pyspark application, and writing to S3 (actually an on-prem Ceph instance), ensuring that spark.sql.sources.partitionOverwriteMode is set to dynamic.
However, this means that my spark-staging files are being staged in S3, and then committed by using a delete-and-rename operation, which is very expensive.
I've tried using the Spark Directory Committer in order to stage files on my local disk. This works great unless spark.sql.sources.partitionOverwriteMode.
After digging through the source code, it looks like the PathOutputCommitter does not support Dynamic Partition Overwriting.
At this point, I'm stuck. I want to be able to write my staging files to local disk, and then commit the results to S3. However, I also need to be able to dynamically overwrite a single partition without overwriting the entire Parquet table.
For reference, I'm running pyspark=3.1.2, and using the following spark-submit command:
spark-submit --repositories https://repository.cloudera.com/artifactory/cloudera-repos/ --packages com.amazonaws:aws-java-sdk:1.11.375,org.apache.hadoop:hadoop-aws:3.2.0,org.apache.spark:spark-hadoop-cloud_2.12:3.1.1.3.1.7270.0-253
I get the following error when spark.sql.sources.partitionOverwriteMode is set to dynamic:
java.io.IOException: PathOutputCommitProtocol does not support dynamicPartitionOverwrite
My spark config is as follows:
self.spark.conf.set("spark.sql.files.ignoreCorruptFiles", "true")
self.spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
self.spark.conf.set("spark.hadoop.fs.s3a.committer.name", "magic")
self.spark.conf.set("spark.sql.sources.commitProtocolClass",
"org.apache.spark.internal.io.cloud.PathOutputCommitProtocol")
self.spark.conf.set("spark.sql.parquet.output.committer.class",
"org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter")
self.spark.conf.set(
"spark.sql.sources.partitionOverwriteMode", "dynamic"
)
afraid the s3a committers don't support the dynamic partition overwrite feature. That actually works by doing lots of renaming, so misses the entire point of zero rename committers.
the "partioned" committer was written by netflix for their use case of updating/overwriting single partitions in an active table. it should work for you as it is the same use case.
consult the documentation

Dataflow writes to GCS bucket, but timestamp in filename is unchanged

I have a question on Apache Beam, especially on dataflow.
I have a pipeline which reads from a cloudsql database and writes to GCS. The filename has a timestamp in it. I expect that each time I run it, it will generate a file with a different timestamp in it.
I tested on my local machine. Beam reads from a postgres db and writes to a file (instead of gcs). It works fine. The files generated have different timestamps in it. Like
jdbc_output.csv-00000-of-00001_2020-08-19_00:11:17.csv
jdbc_output.csv-00000-of-00001_2020-08-19_00:25:07.csv
However, when I deploy to Dataflow, trigger it via Airflow (we have airflow as scheduler), the filename it generates always uses the same timestamp. The timestamp is unchanged even if I run it multiple times. The timestamp is very close to the time when the dataflow template was uploaded.
Here is the simple code to write.
output.apply("Write to Bucket", TextIO.write().to("gs://my-bucket/filename").withNumShards(1)
.withSuffix("_" + String.valueOf(new Timestamp(new Date().getTime())).replace(" ","_") +".csv"));
I'd like to know the reason why dataflow does not use the current time in the filename, instead it uses the timestamp when the template file was uploaded.
Further, how to solve this issue? My plan is to run the dataflow each day, and expecting a new file with a different timestamp in it.
My intuition (because I never tested it) is that the template creation start your pipeline and take a snapshot of it. Therefore, your pipeline is ran, your date time evaluated and kept as-is in the template. And the value never change.
The documentation description also mentions that the pipeline is run before the template creation, like a compilation indeed.
Developers run the pipeline and create a template. The Apache Beam SDK stages files in Cloud Storage, creates a template file (similar to job request), and saves the template file in Cloud Storage.
To fix this, you can use ValueProvider interface. And, I never made the link before now, but it's in the template section of the documentation
Note: However, the cheapest, and the easiest to maintain for simple read in Cloud SQL database and export to file, is to not use Dataflow!

parquet files protection when appending

I have a problem when I try to do ETL on large bunch of files on AWS.
The goal is to convert JSON files to parquet files. due to the size of the files I have to do it batch by batch . Let's say I need to do it in 15 batches , i.e. 15 separate runs to be able to convert all of them.
I am using write.mode("append").format("parquet") to write into parquet files in each glue pyspark job to do that.
My problem is if one job failed for some reason then I don't know what to do - some partitions are updated while some are not, some files in the batch have been processed while some have not. for example if my 9th job failed, I am kind of stuck. I dont want to delete all parquet files to start over, but also dont want to just re-run that 9th job and cause duplicates.
Is there a way to protect parquet files to only append new files into them if the whole job is successful?
THank you!!
Based on your comment and a similar experience I had with this problem, I believe this happens because of S3 eventual consistency. Have a look at Amazon S3 Data Consistency Model here https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html.
We found that using partitioned staging s3a committer with the conflict resolution mode replace made our jobs not fail.
Try the following parameters with your spark jobs:
spark.hadoop.fs.s3a.committer.staging.conflict-mode replace
spark.hadoop.fs.s3a.committer.name partitioned
Also have a read about the committers here:
https://hadoop.apache.org/docs/r3.1.1/hadoop-aws/tools/hadoop-aws/committers.html
Hope this helps!
P.S. If all fails and our files are not too big, you can do a hacky solution where you save your parquet file locally and upload when your spark tasks are complete, but I personally do not recommend.

Compose Google Storage Objects without headers via CLI

I was wondering if it would be possible to compose Google Storage Objects (specifically csv files) without headers (i.e. without the row with column names) while using gsutil.
Currently, I can do the following:
gsutil compose gs://bucket/test_file_1.csv gs://bucket/test_file_2.csv gs://bucket/test-composition-files.csv
However, I will be unable to ingest test-composition-files.csv into Google BigQuery because compose blindly appended the files (including the column names).
One possible solution would be to download the file locally and process it with pandas, but this is not ideal for large files.
Is there any way to do this via the CLI? I could not find anything in the docs.
By reading the comment, I think you are spending effort in the wrong way. I understood that you wanted to load your files into big query, but the large number of file prevented you to do this (too many API calls). And dataflow is too slow.
Maybe you can think differently. I have 2 solutions to propose
If you need "near real time" ingestion, and if file size is bellow 1.5Gb, the best way is to build a function which read the file and perform a stream write to BigQuery. This function is triggered by a Cloud Storage event. If there is several file in the same time, several functions will be spawn. Be careful, stream write to BigQuery is not free
If you can wait up to 2 minutes when a file arrive, I recommend you to build a Cloud Functions, triggered every 2 minutes. This function read the file name in a bucket, move them to a sub directory and perform a load job of all the files in the sub directory. You are limited to 1000 load jobs per day (and per table), a day contains 1440 minutes. Batch every 2 minutes you are OK. The load job are free.
Is it acceptable alternatives?

Processing unpartitioned data with AWS Glue using bookmarking

I have data being written from Kafka to a directory in s3 with a structure like this:
s3://bucket/topics/topic1/files1...N
s3://bucket/topics/topic2/files1...N
.
.
s3://bucket/topics/topicN/files1...N
There is already a lot of data in this bucket and I want to use AWS Glue to transform it into parquet and partition it, but there is way too much data to do it all at once. I was looking into bookmarking and it seems like you can't use it to only read the most recent data or to process data in chunks. Is there a recommended way of processing data like this so that bookmarking will work for when new data comes in?
Also, does bookmarking require that spark or glue has to scan my entire dataset each time I run a job to figure out which files are greater than the last runs max_last_modified timestamp? That seems pretty inefficient especially as the data in the source bucket continues to grow.
I have learned that Glue wants all similar files (files with same structure and purpose) to be under one folder, with optional subfolders.
s3://my-bucket/report-type-a/yyyy/mm/dd/file1.txt
s3://my-bucket/report-type-a/yyyy/mm/dd/file2.txt
...
s3://my-bucket/report-type-b/yyyy/mm/dd/file23.txt
All of the files under report-type-a folder must be of the same format. Put a different report like report-type-b in a different folder.
You might try putting just a few of your input files in the proper place, running your ETL job, placing more files in the bucket, running again, etc.
I tried this by getting the current files working (one file per day), then back-filling historical files. Note however, that this did not work completely. I have been getting files processed ok in s3://my-bucket/report-type/2019/07/report_20190722.gzp, but when I tried to add past files to 's3://my-bucket/report-type/2019/05/report_20190510.gzip`, Glue did not "see" or process the file in the older folder.
However, if I moved the old report to the current partition, it worked: s3://my-bucket/report-type/2019/07/report_20190510.gzip .
AWS Glue bookmarking works only with a select few formats (more here) and when read using glueContext.create_dynamic_frame.from_options function. Along with this job.init() and job.commit() should also be present in the glue script. You can checkout a related answer.