'Premature end of Content-Length' with Spark Application using s3a - amazon-web-services

I'm writing a Spark based application which works around a pretty huge data stored on s3. It's about 15 TB in size uncompressed. Data is laid across multiple small LZO compressed files files, varying from 10-100MB.
By default the job spawns 130k tasks while reading dataset and mapping it to schema.
And then it fails around 70k tasks completions and after ~20 tasks failure.
Exception:
WARN lzo.LzopInputStream: IOException in getCompressedData; likely LZO corruption.
org.apache.http.ConnectionClosedException: Premature end of Content-Length delimited message body
Looks like the s3 connection is getting closed prematurely.
I have tried nearly 40 different combos of configurations.
To summarize them: 1 executor to 3 executors per node, 18GB to 42GB --executor-memory, 3-5 --executor-cores, 1.8GB-4.0 GB spark.yarn.executor.memoryOverhead, Both, Kryo and Default Java serializers, 0.5 to 0.35 spark.memory.storageFraction, default, 130000 to 200000 partitions for bigger dataset. default, 200 to 2001 spark.sql.shuffle.partitions.
And most importantly: 100 to 2048 fs.s3a.connection.maximum property.
[This seems to be most relevant property to exception.]
[In all cases, driver was set to memory = 51GB, cores = 12, MEMORY_AND_DISK_SER level for caching]
Nothing worked!
If I run the program with half of the bigger dataset size (7.5TB), it finishes successfully in 1.5 hr.
What could I be doing wrong?
How do I determine the optimal value for fs.s3a.connection.maximum?
Is it possible that the s3 clients are getting GCed?
Any help will be appreciated!
Environment:
AWS EMR 5.7.0, 60 x i2.2xlarge SPOT Instances (16 vCPU, 61GB RAM, 2 x 800GB SSD), Spark 2.1.0
YARN is used as resource manager.
Code:
It's a fairly simple job, doing something like this:
val sl = StorageLevel.MEMORY_AND_DISK_SER
sparkSession.sparkContext.hadoopConfiguration.set("io.compression.codecs", "com.hadoop.compression.lzo.LzopCodec")
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sparkSession.sparkContext.hadoopConfiguration.setInt("fs.s3a.connection.maximum", 1200)
val dataset_1: DataFrame = sparkSession
.read
.format("csv")
.option("delimiter", ",")
.schema(<schema: StructType>)
.csv("s3a://...")
.select("ID") //15 TB
dataset_1.persist(sl)
print(dataset_1.count())
tmp = dataset_1.groupBy(“ID”).agg(count("*").alias("count_id”))
tmp2 = tmp.groupBy("count_id").agg(count("*").alias(“count_count_id”))
tmp2.write.csv(…)
dataset_1.unpersist()
Full Stacktrace:
17/08/21 20:02:36 INFO compress.CodecPool: Got brand-new decompressor [.lzo]
17/08/21 20:06:18 WARN lzo.LzopInputStream: IOException in getCompressedData; likely LZO corruption.
org.apache.http.ConnectionClosedException: Premature end of Content-Length delimited message body (expected: 79627927; received: 19388396
at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:180)
at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
at com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:151)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
at com.amazonaws.services.s3.model.S3ObjectInputStream.read(S3ObjectInputStream.java:155)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
at com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:151)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
at com.amazonaws.util.LengthCheckInputStream.read(LengthCheckInputStream.java:108)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
at com.amazonaws.services.s3.model.S3ObjectInputStream.read(S3ObjectInputStream.java:155)
at org.apache.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:160)
at java.io.DataInputStream.read(DataInputStream.java:149)
at com.hadoop.compression.lzo.LzopInputStream.readFully(LzopInputStream.java:73)
at com.hadoop.compression.lzo.LzopInputStream.getCompressedData(LzopInputStream.java:321)
at com.hadoop.compression.lzo.LzopInputStream.decompress(LzopInputStream.java:261)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
at java.io.InputStream.read(InputStream.java:101)
at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:186)
at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at org.apache.spark.sql.execution.datasources.HadoopFileLinesReader.hasNext(HadoopFileLinesReader.scala:50)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:99)
at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:91)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:364)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1021)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:996)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:936)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:996)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:700)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
EDIT: We have another service which consume exactly same logs, it works just fine. But it uses old "s3://" scheme and is based on Spark-1.6. I'll try using "s3://" instead of "s3a://".

Related

boto3 s3 resource, bucket.objects.filter why is this so slow and takes so long

I am trying to get the object list from a bucket I have that has probably billions of objects in it, in hundreds of folders, each with xx millions of objects.
The question I have is, why does using s3 resource.objects.filter take so long? And is there a way to use it and still get the same result as using client?
test results below
s3 = boto3.resource("s3")
s3bucket = s3.Bucket("myBucket")
s3objects = s3bucket.objects.filter(Prefix="folder1/hello09_") # Also tried adding page_size(1000) no change
# folder1/ has xxx million objects
# folder1/hello09_ has xx million objects in it (ex. s3://myBucket/folder1/hello09_file0000001)
# takes about 21 minutes
tmp = [s3obj.key for s3obj in s3objects]
# also takes about 20 minutes
tmp = []
for s3obj in s3obj_list:
tmp.append(s3obj.key)
# takes about 2 minutes
# wrote a s3client loop that gets 1000 files at a time
# client = boto3.client('s3')
# client.list_objects_v2(Bucket="myBucket",Prefix="folder1/hello09_")
# loop with NextContinuationToken until no more, 1000 objects each
# takes about 31 minutes
# same s3 client loop but this time changed the prefix to be only up to the deliminator
# The extra 10 minutes might be due to appending the results into an array, key by key?
# client = boto3.client('s3')
# client.list_objects_v2(Bucket="myBucket",Prefix="folder1/")
# loop with NextContinuationToken until no more, 1000 objects each
My best guess is that bucket.objects.filter queries ALL objects up to the last deliminator(folder1/) and THEN filters with whatever is after the last deliminator(hello09_). Instead of straight up querying the whole filter path(folder1/hello09_) like client does.
I thought resource was only querying 1 file at a time or something when used like an array or looped over instead of batch grabbing 1000 files from it (can you do that?). But I had a similar situation where I had dozens of subPrefixes each with 1 file in it, and resource.bucket.objects.filter performed the same as client.list_objects_v2.
Is this a bug in the boto3 filter code, or a feature that can be circumvented so resource can still be used with the same performance as client.
UPDATE:
I didn't know I could get such detailed logs, thanks Anon Coward.
So I guess I was wrong, it was sending the correct filtered request.
No idea how to read the log, but the various errors, retry requests, region redirectors, dns checks, etc are there. I don't think that those should cause the extra 18 minutes. But I have no idea anyways. Maybe some sort of background overhead work or prep of data so that it can be consumed? Vs Client where there are no errors.
Also no option to use S3 Inventory Report at the moment. Being too slow and not being realtime being one of problems.
So does that mean the only options are to use client? As S3 resource has some sort of internal efficiency or overhead problems when dealing with large number of objects, since it seems to work fine with small number of objects (same speed as client)?
I was hoping some sort of settings change could make resource as performant, but if its deep internal then maybe no go. Which is a shame considering how easy it is use Resource and not needing to manage multiple calls with continuationTokens.
botocore.endpoint [DEBUG] Sending http request: <AWSPreparedRequest stream_output=False, method=GET,
url=https://myBucket.s3.xxxx.amazonaws.com/?prefix=myFolder1%2Fhello09_&encoding-type=url,
headers={'User-Agent': b'Boto3/1.20.24 Python/3.8.0 Windows Botocore/1.27.59 Resource', 'X-Amz-Date': b'xxx',
'X-Amz-Content-SHA256': b'xxx', 'Authorization': b'xxx', 'amz-sdk-invocation-id': b'xxx', 'amz-sdk-request': b'attempt=1'}>
...
botocore.parsers [DEBUG] Response headers: ...
...
[DEBUG] Event needs-retry.s3.ListObjects: calling handler <botocore.retryhandler.RetryHandler object at 0x000000????>
botocore.retryhandler [DEBUG] No retry needed.
UPDATE2:
Just for some closure I looked really closely at the http requests of both resource and client and there are differences. Resource does not specify which list, and uses markers instead of continuation Tokens, maybe S3 calculating where to continue from contributes to the slowness? Maybe the list method contributes to the slowness? (especially if list_objectsv1 is slower than v2 or harder to consume). Then there is what AnonCoward said, that resource inherently is slower because it performs more API calls and creates millions of (unneeded?) objects.
# client
url=https://myBucket.s3.xxx.amazonaws.com/
?list-type=2&
prefix=myFolder1%2Fhello_&
continuation-token=xxx
encoding-type=url,
# resource
url=https://myBucket.s3.xxx.amazonaws.com/
?
prefix=myFolder1%2Fhello_&
marker=myFolder1%2Fhello_world001&
encoding-type=url,

Collect one cell from pyspark Dataframe failed [duplicate]

I get the following error when I add --conf spark.driver.maxResultSize=2050 to my spark-submit command.
17/12/27 18:33:19 ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from /XXX.XX.XXX.XX:36245 is closed
17/12/27 18:33:19 WARN Executor: Issue communicating with driver in heartbeater
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:726)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply$mcV$sp(Executor.scala:755)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:755)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:755)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1954)
at org.apache.spark.executor.Executor$$anon$2.run(Executor.scala:755)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Connection from /XXX.XX.XXX.XX:36245 closed
at org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:146)
The reason of adding this configuration was the error:
py4j.protocol.Py4JJavaError: An error occurred while calling o171.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 16 tasks (1048.5 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
Therefore, I increased maxResultSize to 2.5 Gb, but the Spark job fails anyway (the error shown above).
How to solve this issue?
It seems like the problem is the amount of data you are trying to pull back to to your driver is too large. Most likely you are using the collect method to retrieve all values from a DataFrame/RDD. The driver is a single process and by collecting a DataFrame you are pulling all of that data you had distributed across the cluster back to one node. This defeats the purpose of distributing it! It only makes sense to do this after you have reduced the data down to a manageable amount.
You have two options:
If you really need to work with all that data, then you should keep it out on the executors. Use HDFS and Parquet to save the data in a distributed manner and use Spark methods to work with the data on the cluster instead of trying to collect it all back to one place.
If you really need to get the data back to the driver, you should examine whether you really need ALL of the data or not. If you only need summary statistics then compute that out on the executors before calling collect. Or if you only need the top 100 results, then only collect the top 100.
Update:
There is another reason you can run into this error that is less obvious. Spark will try to send data back the driver beyond just when you explicitly call collect. It will also send back accumulator results for each task if you are using accumulators, data for broadcast joins, and some small status data about each task. If you have LOTS of partitions (20k+ in my experience) you can sometimes see this error. This is a known issue with some improvements made, and more in the works.
The options for getting past if if this is your issue are:
Increase spark.driver.maxResultSize or set it to 0 for unlimited
If broadcast joins are the culprit, you can reduce spark.sql.autoBroadcastJoinThreshold to limit the size of broadcast join data
Reduce the number of partitions
Cause: caused by actions like RDD's collect() that send big chunk of data to the driver
Solution:
set by SparkConf: conf.set("spark.driver.maxResultSize", "4g")
OR
set by spark-defaults.conf: spark.driver.maxResultSize 4g
OR
set when calling spark-submit: --conf spark.driver.maxResultSize=4g

No space left on device in Sagemaker model training

I'm using custom algorithm running shipped with Docker image on p2 instance with AWS Sagemaker (a bit similar to https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb)
At the end of training process, I try to write down my model to output directory, that is mounted via Sagemaker (like in tutorial), like this:
model_path = "/opt/ml/model"
model.save(os.path.join(model_path, 'model.h5'))
Unluckily, apparently the model gets too big with time and I get the
following error:
RuntimeError: Problems closing file (file write failed: time = Thu Jul
26 00:24:48 2018
00:24:49 , filename = 'model.h5', file descriptor = 22, errno = 28,
error message = 'No space left on device', buf = 0x1a41d7d0, total
write[...]
So all my hours of GPU time are wasted. How can I prevent this from happening again? Does anyone know what is the size limit for model that I store on Sagemaker/mounted directories?
When you train a model with Estimators, it defaults to 30 GB of storage, which may not be enough. You can use the train_volume_size param on the constructor to increase this value. Try with a large-ish number (like 100GB) and see how big your model is. In subsequent jobs, you can tune down the value to something closer to what you actually need.
Storage costs $0.14 per GB-month of provisioned storage. Partial usage is prorated, so giving yourself some extra room is a cheap insurance policy against running out of storage.
In the SageMaker Jupyter notebook, you can check free space on the filesystem(s) by running !df -h. For a specific path, try something like !df -h /opt.

Aerospike error: All batch queues are full

I am running an Aerospike cluster in Google Cloud. Following the recommendation on this post, I updated to the last version (3.11.1.1) and re-created all servers. In fact, this change cause my 5 servers to operate in a much lower CPU load (it was around 75% load before, now it is on 20%, as show in the graph bellow:
Because of this low load, I decided to reduce the cluster size to 4 servers. When I did this, my application started to receive the following error:
All batch queues are full
I found this discussion about the topic, recommending to change the parameters batch-index-threads and batch-max-unused-buffers with the command
asadm -e "asinfo -v 'set-config:context=service;batch-index-threads=NEW_VALUE'"
I tried many combinations of values (batch-index-threads with 2,4,8,16) and none of them solved the problem, and also changing the batch-index-threads param. Nothing solves my problem. I keep receiving the All batch queues are full error.
Here is my aerospace.conf relevant information:
service {
user root
group root
paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
paxos-recovery-policy auto-reset-master
pidfile /var/run/aerospike/asd.pid
service-threads 32
transaction-queues 32
transaction-threads-per-queue 4
batch-index-threads 40
proto-fd-max 15000
batch-max-requests 30000
replication-fire-and-forget true
}
I use 300GB SSD disks on these servers.
A quick note which may or may not pertain to you:
A common mistake we have seen in the past is that developers decide to use 'batch get' as a general purpose 'get' for single and multiple record requests. The single record get will perform better for single record requests.
It's possible that you are being constrained by the network between the clients and servers. Reducing from 5 to 4 nodes reduced the aggregate pipe. In addition, removing a node will start cluster migrations which adds additional network load.
I would look at the batch-max-buffer-per-queue config parameter.
Maximum number of 128KB response buffers allowed in each batch index
queue. If all batch index queues are full, new batch requests are
rejected.
In conjunction with raising this value from the default of 255 you will want to also raise the batch-max-unused-buffers to batch-index-threads x batch-max-buffer-per-queue + 1 (at least). If you do not do that new buffers will be created and destroyed constantly, as the amount of free (unused) buffers is smaller than the ones you're using. The moment the batch response is served the system will strive to trim the buffers down to the max unused number. You will see this reflected in the batch_index_created_buffers metric constantly rising.
Be aware that you need to have enough DRAM for this. For example if you raise the batch-max-buffer-per-queue to 320 you will consume
40 (`batch-index-threads`) x 320 (`batch-max-buffer-per-queue`) x 128K = 1600MB
For the sake of performance the batch-max-unused-buffers should be set to 13000 which will have a max memory consumption of 1625MB (1.59GB) per-node.

Is there a maximum concurrency for AWS s3 multipart uploads?

Referring to the docs, you can specify the number of concurrent connection when pushing large files to Amazon Web Services s3 using the multipart uploader. While it does say the concurrency defaults to 5, it does not specify a maximum, or whether or not the size of each chunk is derived from the total filesize / concurrency.
I trolled the source code and the comment is pretty much the same as the docs:
Set the concurrency level to use when uploading parts. This affects
how many parts are uploaded in parallel. You must use a local file as
your data source when using a concurrency greater than 1
So my functional build looks like this (the vars are defined by the way, this is just condensed for example):
use Aws\Common\Exception\MultipartUploadException;
use Aws\S3\Model\MultipartUpload\UploadBuilder;
$uploader = UploadBuilder::newInstance()
->setClient($client)
->setSource($file)
->setBucket($bucket)
->setKey($file)
->setConcurrency(30)
->setOption('CacheControl', 'max-age=3600')
->build();
Works great except a 200mb file takes 9 minutes to upload... with 30 concurrent connections? Seems suspicious to me, so I upped concurrency to 100 and the upload time was 8.5 minutes. Such a small difference could just be connection and not code.
So my question is whether or not there's a concurrency maximum, what it is, and if you can specify the size of the chunks or if chunk size is automatically calculated. My goal is to try to get a 500mb file to transfer to AWS s3 within 5 minutes, however I have to optimize that if possible.
Looking through the source code, it looks like 10,000 is the maximum concurrent connections. There is no automatic calculations of chunk sizes based on concurrent connections but you could set those yourself if needed for whatever reason.
I set the chunk size to 10 megs, 20 concurrent connections and it seems to work fine. On a real server I got a 100 meg file to transfer in 23 seconds. Much better than the 3 1/2 to 4 minute it was getting in the dev environments. Interesting, but thems the stats, should anyone else come across this same issue.
This is what my builder ended up being:
$uploader = UploadBuilder::newInstance()
->setClient($client)
->setSource($file)
->setBucket($bicket)
->setKey($file)
->setConcurrency(20)
->setMinPartSize(10485760)
->setOption('CacheControl', 'max-age=3600')
->build();
I may need to up that max cache but as of yet this works acceptably. The key was moving the processor code to the server and not relying on the weakness of my dev environments, no matter how powerful the machine is or high class the internet connection is.
We can abort the process during upload and can halt all the operations and abort the upload at any instance. We can set Concurrency and minimum part size.
$uploader = UploadBuilder::newInstance()
->setClient($client)
->setSource('/path/to/large/file.mov')
->setBucket('mybucket')
->setKey('my-object-key')
->setConcurrency(3)
->setMinPartSize(10485760)
->setOption('CacheControl', 'max-age=3600')
->build();
try {
$uploader->upload();
echo "Upload complete.\n";
} catch (MultipartUploadException $e) {
$uploader->abort();
echo "Upload failed.\n";
}