Unable to use spark kryo serialization - apache-spark-2.0

I am unable to user the kryo serializer in spark-2.0.2. In my scala driver code, I have.
sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
parkConf.set("spark.kryoserializer.buffer.max","64m")
parkConf.set("spark.kryoserializer.buffer","64k")
However, this generates the following error:
[Stage 0:> (0 + 1) / 4]17/03/30 10:15:34 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 157. To avoid this, increase spark.kryoserializer.buffer.max value.
In addition, I tried setting the same properties in spark-defaults.conf, with the same error. Give the error is reporting that the "Available" size is 0:, it would seem that my settings are being ignored.

I now understand. "spark.kryoserializer.buffer.max" must be big enough to accept all the data in the partition, not just a record. For a partition containing 512mb of 256 byte arrays, the buffer.max must be on the order of 768mb. I didn't see this explained anywhere in the docs, and was under the impression that buffer.max had to be big enough to accept the largest serialized record in the partition.

Related

Collect one cell from pyspark Dataframe failed [duplicate]

I get the following error when I add --conf spark.driver.maxResultSize=2050 to my spark-submit command.
17/12/27 18:33:19 ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from /XXX.XX.XXX.XX:36245 is closed
17/12/27 18:33:19 WARN Executor: Issue communicating with driver in heartbeater
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:726)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply$mcV$sp(Executor.scala:755)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:755)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:755)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1954)
at org.apache.spark.executor.Executor$$anon$2.run(Executor.scala:755)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Connection from /XXX.XX.XXX.XX:36245 closed
at org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:146)
The reason of adding this configuration was the error:
py4j.protocol.Py4JJavaError: An error occurred while calling o171.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 16 tasks (1048.5 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
Therefore, I increased maxResultSize to 2.5 Gb, but the Spark job fails anyway (the error shown above).
How to solve this issue?
It seems like the problem is the amount of data you are trying to pull back to to your driver is too large. Most likely you are using the collect method to retrieve all values from a DataFrame/RDD. The driver is a single process and by collecting a DataFrame you are pulling all of that data you had distributed across the cluster back to one node. This defeats the purpose of distributing it! It only makes sense to do this after you have reduced the data down to a manageable amount.
You have two options:
If you really need to work with all that data, then you should keep it out on the executors. Use HDFS and Parquet to save the data in a distributed manner and use Spark methods to work with the data on the cluster instead of trying to collect it all back to one place.
If you really need to get the data back to the driver, you should examine whether you really need ALL of the data or not. If you only need summary statistics then compute that out on the executors before calling collect. Or if you only need the top 100 results, then only collect the top 100.
Update:
There is another reason you can run into this error that is less obvious. Spark will try to send data back the driver beyond just when you explicitly call collect. It will also send back accumulator results for each task if you are using accumulators, data for broadcast joins, and some small status data about each task. If you have LOTS of partitions (20k+ in my experience) you can sometimes see this error. This is a known issue with some improvements made, and more in the works.
The options for getting past if if this is your issue are:
Increase spark.driver.maxResultSize or set it to 0 for unlimited
If broadcast joins are the culprit, you can reduce spark.sql.autoBroadcastJoinThreshold to limit the size of broadcast join data
Reduce the number of partitions
Cause: caused by actions like RDD's collect() that send big chunk of data to the driver
Solution:
set by SparkConf: conf.set("spark.driver.maxResultSize", "4g")
OR
set by spark-defaults.conf: spark.driver.maxResultSize 4g
OR
set when calling spark-submit: --conf spark.driver.maxResultSize=4g

OutOfMemory error when writing to s3a through EMR

Getting an OutOfMemory error for the following PySpark code: (fails after a certain number of rows are written. This does not happen if I attempt to write to the hadoop filesystem instead of using s3a, so I think I've narrowed it down to the problem being s3a. ) - end goal to write to s3a.
Was wondering if there was an optimal s3a configuration where I will not run out of memory for extremely large tables.
df = spark.sql("SELECT * FROM my_big_table")
df.repartition(1).write.option("header", "true").csv("s3a://mycsvlocation/folder/")
my s3a configurations (emr default) :
('fs.s3a.attempts.maximum', '10')
('fs.s3a.buffer.dir', '${hadoop.tmp.dir}/s3a')
('fs.s3a.connection.establish.timeout', '5000')
('fs.s3a.connection.maximum', '15')
('fs.s3a.connection.ssl.enabled', 'true')
('fs.s3a.connection.timeout', '50000')
('fs.s3a.fast.buffer.size', '1048576')
('fs.s3a.fast.upload', 'true')
('fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem')
('fs.s3a.max.total.tasks', '1000')
('fs.s3a.multipart.purge', 'false')
('fs.s3a.multipart.purge.age', '86400')
('fs.s3a.multipart.size', '104857600')
('fs.s3a.multipart.threshold', '2147483647')
('fs.s3a.paging.maximum', '5000')
('fs.s3a.threads.core', '15')
('fs.s3a.threads.keepalivetime', '60')
('fs.s3a.threads.max', '256')
('mapreduce.fileoutputcommitter.algorithm.version', '2')
('spark.authenticate', 'true')
('spark.network.crypto.enabled', 'true')
('spark.network.crypto.saslFallback', 'true')
('spark.speculation', 'false')
base of the stack trace:
Caused by: java.lang.OutOfMemoryError
at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at org.apache.hadoop.fs.s3a.S3AFastOutputStream.write(S3AFastOutputStream.java:194)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:60)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
at sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:282)
at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:125)
at java.io.OutputStreamWriter.write(OutputStreamWriter.java:207)
at com.univocity.parsers.common.input.WriterCharAppender.writeCharsAndReset(WriterCharAppender.java:152)
at com.univocity.parsers.common.AbstractWriter.writeRow(AbstractWriter.java:808)
... 16 more
The problem here is that the default s3a upload does not support uploading of a singular large file greater than 2GB or 2147483647 bytes.
('fs.s3a.multipart.threshold', '2147483647')
My EMR version is older than the more recent ones so the multipart.threshold parameter is only an integer, thus the limit is 2147483647 bytes, for a single "part" or file. The more recent versions use long instead of int and can support a larger single file size limit.
I'll be using a work around write the file to local hdfs then moving it to s3 via a separate java program.

How to solve stability problems in Google Dataflow

I have a Dataflow job that has been running stable for several months.
The last 3 days or so, I've problems with the job, it's getting stuck after a certain amount of time and the only thing I can do is stop the job and start a new one. This happened after 2, 6 and 24 hours of processing. Here is the latest exception:
java.lang.ExceptionInInitializerError
at org.apache.beam.runners.dataflow.worker.options.StreamingDataflowWorkerOptions$WindmillServerStubFactory.create (StreamingDataflowWorkerOptions.java:183)
at org.apache.beam.runners.dataflow.worker.options.StreamingDataflowWorkerOptions$WindmillServerStubFactory.create (StreamingDataflowWorkerOptions.java:169)
at org.apache.beam.sdk.options.ProxyInvocationHandler.returnDefaultHelper (ProxyInvocationHandler.java:592)
at org.apache.beam.sdk.options.ProxyInvocationHandler.getDefault (ProxyInvocationHandler.java:533)
at org.apache.beam.sdk.options.ProxyInvocationHandler.invoke (ProxyInvocationHandler.java:158)
at com.sun.proxy.$Proxy54.getWindmillServerStub (Unknown Source)
at org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.<init> (StreamingDataflowWorker.java:677)
at org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.fromDataflowWorkerHarnessOptions (StreamingDataflowWorker.java:562)
at org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.main (StreamingDataflowWorker.java:274)
Caused by: java.lang.RuntimeException: Loading windmill_service failed:
at org.apache.beam.runners.dataflow.worker.windmill.WindmillServer.<clinit> (WindmillServer.java:42)
Caused by: java.io.IOException: No space left on device
at sun.nio.ch.FileDispatcherImpl.write0 (Native Method)
at sun.nio.ch.FileDispatcherImpl.write (FileDispatcherImpl.java:60)
at sun.nio.ch.IOUtil.writeFromNativeBuffer (IOUtil.java:93)
at sun.nio.ch.IOUtil.write (IOUtil.java:65)
at sun.nio.ch.FileChannelImpl.write (FileChannelImpl.java:211)
at java.nio.channels.Channels.writeFullyImpl (Channels.java:78)
at java.nio.channels.Channels.writeFully (Channels.java:101)
at java.nio.channels.Channels.access$000 (Channels.java:61)
at java.nio.channels.Channels$1.write (Channels.java:174)
at java.nio.file.Files.copy (Files.java:2909)
at java.nio.file.Files.copy (Files.java:3027)
at org.apache.beam.runners.dataflow.worker.windmill.WindmillServer.<clinit> (WindmillServer.java:39)
Seems like there is no space left on a device, but shouldn't this be managed by Google? Or is this an error in my job somehow?
UPDATE:
The workflow is as follows:
Reading mass data from PubSub (up to 1500/s)
Filter some messages
Keeping session window on key and grouping by it
Sort the data and do calculations
Output the data to another PubSub
You can increase the storage capacity in the parameter of your pipelise. Look at this one diskSizeGb in this page
In addition, more you keep data in memory, more you need memory. It's the case for the windows, if you never close them, or if you allow late data for too long time, you need a lot of memory to keep all these data up.
Tune either your pipeline, or your machine type. Or both!

'Premature end of Content-Length' with Spark Application using s3a

I'm writing a Spark based application which works around a pretty huge data stored on s3. It's about 15 TB in size uncompressed. Data is laid across multiple small LZO compressed files files, varying from 10-100MB.
By default the job spawns 130k tasks while reading dataset and mapping it to schema.
And then it fails around 70k tasks completions and after ~20 tasks failure.
Exception:
WARN lzo.LzopInputStream: IOException in getCompressedData; likely LZO corruption.
org.apache.http.ConnectionClosedException: Premature end of Content-Length delimited message body
Looks like the s3 connection is getting closed prematurely.
I have tried nearly 40 different combos of configurations.
To summarize them: 1 executor to 3 executors per node, 18GB to 42GB --executor-memory, 3-5 --executor-cores, 1.8GB-4.0 GB spark.yarn.executor.memoryOverhead, Both, Kryo and Default Java serializers, 0.5 to 0.35 spark.memory.storageFraction, default, 130000 to 200000 partitions for bigger dataset. default, 200 to 2001 spark.sql.shuffle.partitions.
And most importantly: 100 to 2048 fs.s3a.connection.maximum property.
[This seems to be most relevant property to exception.]
[In all cases, driver was set to memory = 51GB, cores = 12, MEMORY_AND_DISK_SER level for caching]
Nothing worked!
If I run the program with half of the bigger dataset size (7.5TB), it finishes successfully in 1.5 hr.
What could I be doing wrong?
How do I determine the optimal value for fs.s3a.connection.maximum?
Is it possible that the s3 clients are getting GCed?
Any help will be appreciated!
Environment:
AWS EMR 5.7.0, 60 x i2.2xlarge SPOT Instances (16 vCPU, 61GB RAM, 2 x 800GB SSD), Spark 2.1.0
YARN is used as resource manager.
Code:
It's a fairly simple job, doing something like this:
val sl = StorageLevel.MEMORY_AND_DISK_SER
sparkSession.sparkContext.hadoopConfiguration.set("io.compression.codecs", "com.hadoop.compression.lzo.LzopCodec")
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sparkSession.sparkContext.hadoopConfiguration.setInt("fs.s3a.connection.maximum", 1200)
val dataset_1: DataFrame = sparkSession
.read
.format("csv")
.option("delimiter", ",")
.schema(<schema: StructType>)
.csv("s3a://...")
.select("ID") //15 TB
dataset_1.persist(sl)
print(dataset_1.count())
tmp = dataset_1.groupBy(“ID”).agg(count("*").alias("count_id”))
tmp2 = tmp.groupBy("count_id").agg(count("*").alias(“count_count_id”))
tmp2.write.csv(…)
dataset_1.unpersist()
Full Stacktrace:
17/08/21 20:02:36 INFO compress.CodecPool: Got brand-new decompressor [.lzo]
17/08/21 20:06:18 WARN lzo.LzopInputStream: IOException in getCompressedData; likely LZO corruption.
org.apache.http.ConnectionClosedException: Premature end of Content-Length delimited message body (expected: 79627927; received: 19388396
at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:180)
at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
at com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:151)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
at com.amazonaws.services.s3.model.S3ObjectInputStream.read(S3ObjectInputStream.java:155)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
at com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:151)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
at com.amazonaws.util.LengthCheckInputStream.read(LengthCheckInputStream.java:108)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
at com.amazonaws.services.s3.model.S3ObjectInputStream.read(S3ObjectInputStream.java:155)
at org.apache.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:160)
at java.io.DataInputStream.read(DataInputStream.java:149)
at com.hadoop.compression.lzo.LzopInputStream.readFully(LzopInputStream.java:73)
at com.hadoop.compression.lzo.LzopInputStream.getCompressedData(LzopInputStream.java:321)
at com.hadoop.compression.lzo.LzopInputStream.decompress(LzopInputStream.java:261)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
at java.io.InputStream.read(InputStream.java:101)
at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:186)
at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at org.apache.spark.sql.execution.datasources.HadoopFileLinesReader.hasNext(HadoopFileLinesReader.scala:50)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:99)
at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:91)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:364)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1021)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:996)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:936)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:996)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:700)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
EDIT: We have another service which consume exactly same logs, it works just fine. But it uses old "s3://" scheme and is based on Spark-1.6. I'll try using "s3://" instead of "s3a://".

Aerospike error: All batch queues are full

I am running an Aerospike cluster in Google Cloud. Following the recommendation on this post, I updated to the last version (3.11.1.1) and re-created all servers. In fact, this change cause my 5 servers to operate in a much lower CPU load (it was around 75% load before, now it is on 20%, as show in the graph bellow:
Because of this low load, I decided to reduce the cluster size to 4 servers. When I did this, my application started to receive the following error:
All batch queues are full
I found this discussion about the topic, recommending to change the parameters batch-index-threads and batch-max-unused-buffers with the command
asadm -e "asinfo -v 'set-config:context=service;batch-index-threads=NEW_VALUE'"
I tried many combinations of values (batch-index-threads with 2,4,8,16) and none of them solved the problem, and also changing the batch-index-threads param. Nothing solves my problem. I keep receiving the All batch queues are full error.
Here is my aerospace.conf relevant information:
service {
user root
group root
paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
paxos-recovery-policy auto-reset-master
pidfile /var/run/aerospike/asd.pid
service-threads 32
transaction-queues 32
transaction-threads-per-queue 4
batch-index-threads 40
proto-fd-max 15000
batch-max-requests 30000
replication-fire-and-forget true
}
I use 300GB SSD disks on these servers.
A quick note which may or may not pertain to you:
A common mistake we have seen in the past is that developers decide to use 'batch get' as a general purpose 'get' for single and multiple record requests. The single record get will perform better for single record requests.
It's possible that you are being constrained by the network between the clients and servers. Reducing from 5 to 4 nodes reduced the aggregate pipe. In addition, removing a node will start cluster migrations which adds additional network load.
I would look at the batch-max-buffer-per-queue config parameter.
Maximum number of 128KB response buffers allowed in each batch index
queue. If all batch index queues are full, new batch requests are
rejected.
In conjunction with raising this value from the default of 255 you will want to also raise the batch-max-unused-buffers to batch-index-threads x batch-max-buffer-per-queue + 1 (at least). If you do not do that new buffers will be created and destroyed constantly, as the amount of free (unused) buffers is smaller than the ones you're using. The moment the batch response is served the system will strive to trim the buffers down to the max unused number. You will see this reflected in the batch_index_created_buffers metric constantly rising.
Be aware that you need to have enough DRAM for this. For example if you raise the batch-max-buffer-per-queue to 320 you will consume
40 (`batch-index-threads`) x 320 (`batch-max-buffer-per-queue`) x 128K = 1600MB
For the sake of performance the batch-max-unused-buffers should be set to 13000 which will have a max memory consumption of 1625MB (1.59GB) per-node.