Spark doesn't output .crc files on S3 - amazon-web-services

When I use spark locally, writing data on my local filesystem, it creates some usefull .crc file.
Using the same job on Aws EMR and writing on S3, the .crc files are not written.
Is this normal? Is there a way to force the writing of .crc files on S3?

those .crc files are just created by the the low level bits of the Hadoop FS binding so that it can identify when a block is corrupt, and, on HDFS, switch to another datanode's copy of the data for the read and kick off a re-replication of one of the good copies.
On S3, stopping corruption is left to AWS.
What you can get off S3 is the etag of a file, which is the md5sum on a small upload; on a multipart upload it is some other string, which again, changes when you upload it.
you can get at this value with the Hadoop 3.1+ version of the S3A connector, though it's off by default as distcp gets very confused when uploading from HDFS. For earlier versions, you can't get at it, nor does the aws s3 command show it. You'd have to try some other S3 libraries (it's just a HEAD request, after all)

Related

Apache Spark - Write Parquet Files to S3 with both Dynamic Partition Overwrite and S3 Committer

I'm currently building an application with Apache Spark (pyspark), and I have the following use case:
Run pyspark with local mode (using spark-submit local[*]).
Write the results of my spark job to S3 in the form of partitioned Parquet files.
Ensure that each job overwrite the particular partition it is writing to, in order to ensure idempotent jobs.
Ensure that spark-staging files are written to local disk before being committed to S3, as staging in S3, and then committing via a rename operation, is very expensive.
For various internal reasons, all four of the above bullet points are non-negotiable.
I have everything but the last bullet point working. I'm running a pyspark application, and writing to S3 (actually an on-prem Ceph instance), ensuring that spark.sql.sources.partitionOverwriteMode is set to dynamic.
However, this means that my spark-staging files are being staged in S3, and then committed by using a delete-and-rename operation, which is very expensive.
I've tried using the Spark Directory Committer in order to stage files on my local disk. This works great unless spark.sql.sources.partitionOverwriteMode.
After digging through the source code, it looks like the PathOutputCommitter does not support Dynamic Partition Overwriting.
At this point, I'm stuck. I want to be able to write my staging files to local disk, and then commit the results to S3. However, I also need to be able to dynamically overwrite a single partition without overwriting the entire Parquet table.
For reference, I'm running pyspark=3.1.2, and using the following spark-submit command:
spark-submit --repositories https://repository.cloudera.com/artifactory/cloudera-repos/ --packages com.amazonaws:aws-java-sdk:1.11.375,org.apache.hadoop:hadoop-aws:3.2.0,org.apache.spark:spark-hadoop-cloud_2.12:3.1.1.3.1.7270.0-253
I get the following error when spark.sql.sources.partitionOverwriteMode is set to dynamic:
java.io.IOException: PathOutputCommitProtocol does not support dynamicPartitionOverwrite
My spark config is as follows:
self.spark.conf.set("spark.sql.files.ignoreCorruptFiles", "true")
self.spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
self.spark.conf.set("spark.hadoop.fs.s3a.committer.name", "magic")
self.spark.conf.set("spark.sql.sources.commitProtocolClass",
"org.apache.spark.internal.io.cloud.PathOutputCommitProtocol")
self.spark.conf.set("spark.sql.parquet.output.committer.class",
"org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter")
self.spark.conf.set(
"spark.sql.sources.partitionOverwriteMode", "dynamic"
)
afraid the s3a committers don't support the dynamic partition overwrite feature. That actually works by doing lots of renaming, so misses the entire point of zero rename committers.
the "partioned" committer was written by netflix for their use case of updating/overwriting single partitions in an active table. it should work for you as it is the same use case.
consult the documentation

Reading Spark Dataframe from S3 Bucket While Another Process Writes to it?

Would there be any issues reading a spark dataframe and say persisting it via a Jupyter notebook and another process writing to the s3 bucket concurrently?
Say,
I read a dataframe like:
s3 = spark.read.parquet('s3://path/to/table')
And work on this in a notebook.
Concurrently I write out to the same s3 bucket at some point via a different process, e.g.
system('s3-dist-cp --src --dest s3://path/to/table)
Would this ever prove to be an issue? I am ok with messing up the read / dataframe but I would not want to block writing out to the bucket.
This will cause FNF exception on any action on the first DF that you read.
s3 = spark.read.parquet('s3://path/to/table')
The first spark job that is involved in the above is listing leaf files and directories. As there was another process that was writing/ rewriting data, the paths would be stale.
Furthermore, the eventual consistency behavior of the S3 should also be considered.

Amazon S3 synch command uploads the entire modified file again or just the delta in the file?

My system generate large log files continuously and I want to upload all the log files to Amazon S3. I am planning to use the s3 synch command for this. My system appens the logs in the same file until they are of about 50MB and then it create new log file. I understand that synch command will synch the modified local log file in s3 bucket, but I dont want to upload the entire log file when the file changes as the files are large and sending same data again and again will consume my data bandwidth.
So I am wondering if s3 synch command sends the entire modified file or just the delta in the file?
The documentation implies that it copies the whole updated files
Recursively copies new and updated files
Plus there would be no way to to do this without downloading the file from S3 which would effectively double the cost of an upload since you'd pay the download and upload costs.

AWS S3 sync between buckets overwriting newer destination files

We have two s3 buckets, and we have a sync cron job that should copy bucket1 changes to bucket2.
aws s3 sync s3://bucket1/images/ s3://bucket2/images/
When a new image is added to bucket1, it correctly gets copied over to bucket2.
However, if we upload a new version of that image to bucket2, when the sync job next runs it actually copies the older version from bucket1 over to bucket2, replacing the newer version we just put there.
This is part of a migration process, and in time the only place images will be uploaded to will be bucket2, but for the time being sometimes they may be uploaded to either, and we only want changes form bucket1 to be copied up to bucket2, NOT the other way round.
Why does the aws sync job seem to think that the file on bucket1 has changed? Does it not know that the file in bucket2 is newer, so it should be left alone?
The AWS Command-Line Interface (CLI) aws s3 sync command copies content from the Source location to the Destination location. It only copies files that have been added or changed since the last sync.
It is designed as a one-way sync, not a two-way sync. Your file is being overwritten because the file in the Source is not present in the Destination. This is correct behavior.
There is limited range to tweak these controls, such as (from the sync command documentation):
--exact-timestamps (boolean) When syncing from S3 to local, same-sized items will be ignored only when the timestamps match exactly. The default behavior is to ignore same-sized items unless the local version is newer than the S3 version.
However, there does not appear to be an option that stops overwriting of files merely because a file with the same name exists, or something with a preference to keep newer files.
If you want a two-way sync with more specific rules, you will need to code it yourself.

AWS S3 Sync very slow when copying to large directories

When syncing data to an empty directory in S3 using AWS-CLI, it's almost instant. However, when syncing to a large directory (several million folders), it takes a very long time before even starting to upload / sync the files.
Is there an alternative method? It looks like it's trying to take account of all files in an S3 directory before syncing - I don't need that, and uploading the data without checking beforehand would be fine.
The sync command will need to enumerate all of the files in the bucket to determine whether a local file already exists in the bucket and if it is the same as the local file. The more documents you have in the bucket, the longer it's going to take.
If you don't need this sync behavior just use a recursive copy command like:
aws s3 cp --recursive . s3://mybucket/
and this should copy all of the local files in the current directory to the bucket in S3.
If you use the unofficial s3cmd from S3 Tools, you can use the --no-check-md5 option while using sync to disable the MD5 sums comparison to significantly speed up the process.
--no-check-md5 Do not check MD5 sums when comparing files for [sync].
Only size will be compared. May significantly speed up
transfer but may also miss some changed files.
Source: https://s3tools.org/usage
Example: s3cmd --no-check-md5 sync /directory/to/sync s3://mys3bucket/